Content uploaded by Sheng Ren

Author content

All content in this area was uploaded by Sheng Ren on Feb 25, 2016

Content may be subject to copyright.

1

Computational and Statistical Analysis of Metabolomics Data

Sheng Ren1,2, Anna A. Hinzman1,3, Emily L. Kang2, Rhonda D. Szczesniak6,7, L. Jason Lu1,3,4,5,7,*

2014.12.16 for Metabolomics

1. Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation,

3333 Burnet Avenue, Cincinnati, OH 45229-3026

2. Department of Mathematical Sciences, McMicken College of Arts & Sciences, University of

Cincinnati, 2815 Commons Way, Cincinnati, OH 45221-0025

3. Department of Biomedical Engineering, College of Medicine, University of Cincinnati, 231

Albert Sabin Way, Cincinnati, OH 45267-0524

4. Department of Environmental Health, College of Medicine, University of Cincinnati, 231

Albert Sabin Way, Cincinnati, OH 45267-0524

5. Department of Computer Science, College of Medicine, University of Cincinnati, 231 Albert

Sabin Way, Cincinnati, OH 45267-0524

6. Division of Pulmonary Medicine, Cincinnati Children’s Hospital Research Foundation, 3333

Burnet Avenue, Cincinnati, OH 45229-3026

7. Division of Biostatistics and Epidemiology, Cincinnati Children’s Hospital Research

Foundation, 3333 Burnet Avenue, Cincinnati, OH 45229-3026

* Corresponding authors:

Long J. Lu, Ph.D., Associate Professor

Division of Biomedical Informatics, MLC 7024

Cincinnati Children’s Hospital Research Foundation

3333 Burnet Avenue

Cincinnati, OH 45229

Phone: (513) 636-8720

Fax: (513) 636-2056

Email: long.lu@cchmc.org

Website: http://dragon.cchmc.org

2

ABSTRACT

Metabolomics is the comprehensive study of small molecule metabolites in biological systems.

By assaying and analyzing thousands of metabolites in biological samples, it provides a whole

picture of metabolic status and biochemical events happening within an organism and has

become an increasingly powerful tool in the disease research. In metabolomics, it is common to

deal with large amounts of data generated by nuclear magnetic resonance (NMR) and/or mass

spectrometry (MS). Moreover, based on different goals and designs of studies, it may be

necessary to use a variety of data analysis methods or a combination of them in order to obtain

an accurate and comprehensive result. In this review, we intend to provide an overview of

computational and statistical methods that are commonly applied to analyze metabolomics data.

The review is divided into four sections. The first section will introduce the background and the

databases and resources available for metabolomics research. The second section will briefly

describe the principles of the two main experimental methods that produce metabolomics data:

MS and NMR, followed by the third section that describes the preprocessing of the data from

these two approaches. In the fourth and the most important section, we will review four main

types of analysis that can be performed on metabolomics data with examples in metabolomics.

These are unsupervised learning methods, supervised learning methods, pathway analysis

methods and analysis of time course metabolomics data. We conclude by providing a Table

summarizing the principles and tools that we discussed in this review.

Key words: computational, statistical, unsupervised learning, supervised learning, pathway

analysis, time course data

3

INTRODUCTION

Omics is the study of the totality of biomolecules. Just as genomics is the analysis of a complete

genome, proteomics is the comprehensive analysis of proteins, and transcriptomics is the

comprehensive analysis of gene transcripts, metabolomics is the analysis of the complete set of

metabolites, or metabolome, in an organism (Griffin, Shockcor 2004; Oliver 2002). The

metabolome represents a large number of compounds including parts of amino acids, lipids,

organic acids, or nucleotides. Metabolites are used in or produced by chemical reactions, and

their levels can be regarded as the ultimate response of biological systems to genetic or

environmental changes. Therefore, it has been suggested that the metabolome is more sensitive

to systematic perturbations than the transcriptome and the proteome (Kell, Brown, Davey, Dunn,

Spasic, Oliver 2005).

Cellular processes involve specific metabolites for reactions. Studying and recording these

metabolites can lead to the discovery of biomarkers which are measureable biological

characteristics that can be used to diagnose, monitor, or predict the risk of diseases (Xia,

Broadhurst, Wilson, Wishart 2012). There are several approaches to studying the metabolome,

including target analysis, metabolic profiling and metabolic fingerprinting (Griffin, Shockcor

2004). Target analysis focuses on the quantification of a small number of known metabolites.

Metabolic profiling focuses on a larger set of unknown metabolites. Metabolic fingerprinting

focuses on the extracellular metabolites. Rather than studying individual metabolites,

metabolomics collects quantitative data over a large range of metabolites to obtain an overall

understanding of the metabolism associated with a specific condition (Kaddurah-Daouk,

Krishnan 2009).

4

Discovering biomarkers through metabolomics will help diagnose, prevent, and produce

drugs for treatment of diseases, including cancer (Griffin, Shockcor 2004), cardiovascular

diseases (Griffin, Atherton, Shockcor, Atzori 2011), central nervous system diseases (Kaddurah-

Daouk, Krishnan 2009), diabetes (Wang-Sattler, Yu, Herder, Messias, Floegel, He, Heim et al.

2012) and cystic fibrosis (Wetmore, Joseloff, Pilewski, Lee, Lawton, Mitchell, Milburn et al.

2010). Metabolomics can be a minimally invasive procedure since data can be gathered from

plasma, urine, cerebrospinal fluid (CSF), or tissue extracts. It has also been used in studying

plants to understand cellular processes and to decode the function of genes, in studying animals

to discover biomarkers, in foods research, and in herbal medicines (Putri, Nakayama, Matsuda,

Uchikata, Kobayashi, Matsubara, Fukusaki 2013).

The idea behind metabolomics has been in existence since people have used the sweetness

of urine to detect high glucose in diabetes. In the 1960s, chromatographic separation techniques

made it possible to detect individual metabolites. Robinson and Pauling’s “Quantitative Analysis

of Urine Vapor and Breath by Gas-Liquid Partition Chromatography”, written in 1971, was the

first scientific article about metabolomics (Pauling, Robinson, Teranishi, Cary 1971). The word

“metabolome” was coined by Olivier et al. in 1998 and defined as the set of metabolites

synthesized by an organism (Oliver, Winson, Kell, Baganz 1998). Nicholson et al. first used the

word metabonomics in a publication in 1999 to mean “the quantitative measurement of the

dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or

genetic modification” (Nicholson, Lindon, Holmes 1999). Griffin, in his paper (Griffin,

Shockcor 2004), suggested one of the best definitions of metabolomics given by Oliver is “the

complete set of metabolites/low-molecular-weight intermediates, which are context dependent,

5

varying according to the physiology, developmental or pathological state of the cell, tissue, organ

or organism”.

DATABASES AND RESOURCES FOR METABOLOMICS

In 2004, the Metabolomics Society was established to promote the growth, use and

understanding of metabolomics in the life sciences. The Metabolomics Society later launched a

journal, Metabolomics, published by Springer. The Society now has a Twitter feed

(@MetabolomicsSoc), which provides news from the Metabolomics Society, its annual

international conference, and the Metabolomics journal. METLIN, the first metabolomics

database, was also established in 2004. In 2005, the Human Metabolome Project was launched to

find and catalogue all of the metabolites in human tissue and biofluids. This metabolite

information is kept in the Human Metabolome Database, which produced its first draft in 2007

(Wishart, Jewison, Guo, Wilson, Knox, Liu, Djoumbou et al. 2013). In recent years the number

of papers written about metabolomics has been increasing. More than 800 papers were written in

2009, compared to fewer than 50 in 2002 (Griffiths, Koal, Wang, Kohl, Enot, Deigner 2010). As

technologies for the quantification and analysis of metabolomics are adapted and improved, the

use of metabolomics is expected to continue to grow.

There are many databases containing metabolomics data, and each has different

information, ranging from NMR and MS spectra to metabolic pathways. The purpose of

metabolic databases is to organize the many metabolites in a way that helps researchers easily

identify and analyze metabolomics data. The information found in metabolite databases has

continuously been updated in recent years as metabolomics studies have become more widely

6

conducted. Just as metabolomics is a new field and new approaches are still being discovered,

metabolomics databases are new and still improving. These databases contain various types of

information, including concentration, anatomical location, and related disorders. Among the

databases are the Human Metabolome Database (HMDB), MassBank, METLIN, lipid

metabolites and pathways strategy (LIPID MAPS), Madison metabolomics consortium database,

and Kyoto Encyclopedia of Genes and Genomes (KEGG).

HMDB contains detailed information for 40,444 metabolite entries with chemical, clinical,

and molecular biology/biochemistry data (Wishart, Jewison, Guo, Wilson, Knox, Liu, Djoumbou

et al. 2013). Each metabolite in the database includes a “Metabocard” with information including

molecular weights, spectra, associated diseases, and biochemical pathways. The purpose of

HMDB is to identify all of the metabolites in the human. The 39,293 spectra in MassBank are

useful for the chemical identification and structure interpretation of chemical compounds

detected by mass spectrometry (Horai, Arita, Kanaya, Nihei, Ikeda, Suwa, Ojima et al. 2010).

METLIN is a repository of over 75,000 endogenous and exogenous metabolites from essentially

any living creature, including bacteria, plants and animals (Smith, O'Maille, Want, Qin, Trauger,

Brandon, Custodio et al. 2005). LIPID MAPS is not only the largest database of lipid molecular

structures, but the lipid maps resource contains information on the lipid proteome, quantitative

estimates of lipids in the human plasma, the first complete map of the macrophage lipidome, and

a host of tools for lipid biology, including mass spectrometry tools, structure tools, and pathway

tools (Fahy, Sud, Cotter, Subramaniam 2007). The Madison metabolomics consortium database

is a resource for metabolomics research based on nuclear magnetic resonance (NMR)

spectroscopy and mass spectrometry (MS) (Cui, Lewis, Hegeman, Anderson, Li, Schulte,

Westler et al. 2008). The current total number of compounds in the Madison metabolomics

7

consortium database is 20,306. Finally, KEGG contains information about metabolic pathways

(Kanehisa 2002).

Metabolomics reporting and databases currently suffer from a lack of common language or

ontologies (Wishart 2007). The issue is further aggravated by the large number of different types

of instruments used in research, each of which has its own language. This makes working with

different instruments or other laboratories difficult. A possible solution (Wishart 2007) is

standardizing data by entering data into an electronic record-keeping system such as LIMS. The

establishment of common reporting standards and data formats would make it much easier to

compare and locate metabolomics data.

EXPERIMENTAL METHODS

While tools for transcriptomics and proteomics have made significant improvements in recent

years, tools for metabolomics are still emerging. No analytical tool can measure all of the

metabolites in an organism, but nuclear magnetic resonance (NMR) spectroscopy and mass

spectrometry (MS) combined come the closest. That is, using NMR and MS together may result

in more complete data than using them individually. NMR spectroscopy and MS are the most

common technologies used to collect data from biofluids or tissues. NMR can be used to identify

and quantify metabolites from complex mixtures. NMR spectroscopy relies on certain nuclei that

possess a magnetic spin and when placed inside a magnetic field can adopt different energy

levels that can be observed using radiofrequency waves (Griffin, Atherton, Shockcor, Atzori

2011). Proton NMR (1H NMR) is the most commonly used for metabolomics. NMR approaches

can typically detect 20-40 metabolites in tissue and 50 in urine samples (Griffin, Shockcor 2004).

8

It is non-destructive because the sample does not come in contact with the detector, usually

occurs in a noninvasive manner, requires no chemical derivation, and can be easily reproduced

(Armitage, Barbas 2014). A major advantage of NMR is that the signal frequencies observed in

an NMR spectrum are directly proportional to the concentration of the nuclei in the sample

(Smolinska, Blanchet, Buydens, Wijmenga 2012). However, compared to MS, it has a lower

sensitivity, only medium to high abundance will be detected (Smolinska, Blanchet, Buydens,

Wijmenga 2012).

Mass spectrometry-based metabolomics is more commonly used than NMR, judging by the

number of publications annually that use each technique (Dettmer, Aronov, Hammock 2007). In

order to separate the makeup of a mixture, MS is always coupled to other separation techniques.

Among all hyphenated MS methods, gas chromatography MS (GC-MS) and liquid

chromatography MS (LC-MS) are most popular, as they can be used to detect low-concentration

metabolites. GC-MS can be applied to the analysis of low molecular weight metabolites, and it is

highly sensitive, quantitative and reproducible (Armitage, Barbas 2014). GC-MS is also

preferred in terms of cost and operational issues (Theodoridis, Gika, Want, Wilson 2012). It can

typically detect 1,000 metabolites (Griffin, Shockcor 2004). LC-MS, which can also be prefixed

with high (HPLC) or ultra-high (UPLC) performance, is suitable for the analysis of non-volatile

chemicals, therefore it is complementary to GC-MS (Armitage, Barbas 2014). It has a high

sensitivity and is less time consuming than GC-MS, but it can be more expensive (Griffin,

Shockcor 2004). One advantage of LC-MS is that it can separate and detect a wide range of

molecules and allows for the collection of both quantitative and structural information

(Theodoridis, Gika, Want, Wilson 2012).

9

In addition to the three main stream methods mentioned above, there are also other

important spectroscopy and hyphenated methods. Among all spectroscopy methods, vibrational

spectroscopy is one of the oldest (Li, Wang, Nie, Zhang 2012). There are primarily two

vibrational methods utilized: Fourier-transform infrared spectrometry (FT-IR) and Raman

spectroscopy (RS). FT-IR is inexpensive and good for high-throughput screening but it is very

poor at distinguishing metabolites within a class of compounds (Griffin, Shockcor 2004) and

much less sensitive compared to MS (Patel, Patel, Patel, Rajput, Patel 2010). Moreover, even

combining HPLC with FTIR, the method HPLC-FTIR may also have the disadvantage of

yielding a low level of detailed molecular identifications (Nin, Izquierdo-García, Lorente 2012)

and the progress in this hyphenated technique is slow (Patel, Patel, Patel, Rajput, Patel 2010).

Raman spectroscopy is an extension of FT-IR and it has been used for the identification of

microorganisms of medical relevance (Dunn, Bailey, Johnson 2005). However, although there

are some advantage to RS over FT-IR, it has similar problems as FT-IR (Griffin, Shockcor

2004). Capillary electrophoresis mass spectrometry (CE-MS) is a powerful separation technique

for charged metabolites (Dettmer, Aronov, Hammock 2007) and has been predominantly used in

targeted metabolomics (Gika, Theodoridis, Plumb, Wilson 2014). However, since the analytical

system stability is not as high as in GC or LC–MS, it is not yet applied widely in global

metabolite profiling (Theodoridis, Gika, Want, Wilson 2012).

The experimental design is an important aspect to consider before conducting any

metabolomics experiments. It is a plan of data-gathering studies, which is constructed to control

process variation in the experiments and to ensure potential confounders are not present or are

well-characterized (Dunn, Wilson, Nicholls, Broadhurst 2012). The process variation in the

experiments can be introduced in the sample collection, storage and preparation steps. For

10

example, sample collection time of day (Slupsky, Rankin, Wagner, Fu, Chang, Weljie, Saude et

al. 2007), storage and experiment temperature (Cao, Dong, Cai, Chen 2008; Lauridsen, Hansen,

Jaroszewski, Cornett 2007) all have an impact on the metabolite profile determined. These

conditions and procedures, if not standardized, may lead to spurious biomarkers being reported

and may account for a lack of reproducibility between laboratories (Emwas, Luchinat, Turano,

Tenori, Roy, Salek, Ryan et al. 2014). Therefore, a standard operating procedure is essential to

control variation introduced during the sample preparation process. There are some sample

procedures for NMR (Emwas, Luchinat, Turano, Tenori, Roy, Salek, Ryan et al. 2014) and MS

(Dunn, Broadhurst, Begley, Zelena, Francis-McIntyre, Anderson, Brown et al. 2011) studies.

Controlling for potential confounding factors is also critical and is better addressed in

experimental design (Broadhurst, Kell 2006). In metabolomics research, the confounding factors

are those variables that correlate with both response variables (e.g. disease status) and

metabolites concentrations. Such factors include but are not limited to age, gender (Emwas,

Luchinat, Turano, Tenori, Roy, Salek, Ryan et al. 2014), diet (Heinzmann, Brown, Chan,

Bictash, Dumas, Kochhar, Stamler et al. 2010), physical activity (Enea, Seguin, Petitpas-Mulliez,

Boildieu, Boisseau, Delpech, Diaz et al. 2010) and individual metabolic phenotypes (Assfalg,

Bertini, Colangiuli, Luchinat, Schäfer, Schütz, Spraul 2008). Those factors, if not properly

controlled, could lead to failure of discovering true significance or reporting spurious findings

(Dunn, Wilson, Nicholls, Broadhurst 2012). For human studies, we should control for

confounders first by defining more specific criteria in selecting subjects, since subjects are

heterogeneous with respect to demographic and lifestyle factors. This is especially important in

defining healthy control (Scalbert, Brennan, Fiehn, Hankemeier, Kristal, van Ommen, Pujos-

Guillot et al. 2009). Then, it is recommended to perform sample randomization in order to reduce

11

the correlation between confounders and sample analysis order and instrument conditions (Dunn,

Wilson, Nicholls, Broadhurst 2012). More advanced statistical experimental design methods, for

example nested stratified proportional randomization and matched case-control design, can be

used when outcomes (e.g. disease status) are imbalance (Dunn, Wilson, Nicholls, Broadhurst

2012; Xia, Broadhurst, Wilson, Wishart 2012). For animal studies, where many confounding

factors can be well controlled, large number of samples are not needed compared to human

studies (Emwas, Luchinat, Turano, Tenori, Roy, Salek, Ryan et al. 2014). In fact, by using some

statistical experimental design methods, such as factorial design and randomized block design,

researchers can minimize the number of samples used and control most confounders at the same

time (Kilkenny, Parsons, Kadyszewski, Festing, Cuthill, Fry, Hutton et al. 2009). There are rich

literatures discussing experimental design in both statistical methodology (Box, Hunter, Hunter

1978; Montgomery 2008) and its applications in high throughput biological assays (Riter, Vitek,

Gooding, Hodge, Julian 2005; Rocke 2004).

DATA PREPROCESSING

Data preprocessing plays an important role and can substantially affect subsequent statistical

analysis results. It takes place after the raw spectra are collected and serves as the link between

raw data and statistical analysis. NMR and MS spectra typically show differences in peak shape,

width, and position due to noise, sample differences or instrument factors (Blekherman,

Laubenbacher, Cortes, Mendes, Torti, Akman, Torti et al. 2011; Smolinska, Blanchet, Buydens,

Wijmenga 2012). The goal of preprocessing is to correct those differences for better

quantification of metabolites and improved comparability between different samples. Similar

12

preprocessing considerations and methods can be applied to both MS and NMR (Vettukattil

2015).

Preprocessing for NMR typically includes baseline correction, alignment, binning,

normalization, and scaling (Smolinska, Blanchet, Buydens, Wijmenga 2012). Baseline correction

is a procedure to correct the distortion in the baseline caused by systematic artifacts. It is very

important since signal intensities are calculated with reference to the baseline (Vettukattil 2015).

Current automatic baseline correction methods are mostly based on polynomial fitting such as

local weighted scatter plot smoothing (Xi, Rocke 2008) and splines (Eilers, Marx 1996). After

baseline correction, some of unwanted spectral region are often removed, such as water and other

contaminations (Vettukattil 2015). Due to differences in instrumental factors, salt concentrations,

temperature and changes of pH, peak shifts can always been observed between samples.

Therefore, alignment must be performed in order to correct those shifts. Since most shifts in

NMR are local shifts, it is often insufficient to simply perform global alignment by spectral

referencing (Smolinska, Blanchet, Buydens, Wijmenga 2012). Several automatic methods like

icoshift (Savorani, Tomasi, Engelsen 2010) and correlation optimized warping (Tomasi, van den

Berg, Andersson 2004) can be used to perform local alignment. After automatic baseline

correction or alignment, it is recommended to visually inspect the processed spectra and one can

also choose to manually correct baseline and perform alignment (Vettukattil 2015). Binning (also

known as bucketing) is a dimension reduction technique, which divide the spectra into segments

and replace the data values within each bin by a representative value. It is a useful technique

when perfect alignment is hard to achieve (Smolinska, Blanchet, Buydens, Wijmenga 2012).

Traditional equal sized binning is not recommended since peaks can be split into two bins. Some

adaptive binning methods such as Gaussian binning (Anderson, Reo, DelRaso, Doom, Raymer

13

2008) and adaptive binning using wavelet transform (Davis, Charlton, Godward, Jones, Harrison,

Wilson 2007) can overcome this difficulty to some extent. However, binning can reduce spectral

resolution, therefore, it may be better to avoid binning when spectral misalignment is not serious

or when identification of metabolites is more important (Vettukattil 2015). Normalization can

remove or correct for some systematic variations between samples, for example sample dilution

factors, which is a key factor in analysis of urinary metabolites (Smolinska, Blanchet, Buydens,

Wijmenga 2012), in order to make samples more comparable with each other. Typically,

normalization is a multiplication of every row (sample) by a sample specific constant (Craig,

Cloarec, Holmes, Nicholson, Lindon 2006). One popular normalization technique is total integral

normalization, where the total spectral intensity of each sample is the constant. When some of

the strong signals change considerably between samples, probabilistic quotient normalization can

offer more robust results than total integral normalization (Dieterle, Ross, Schlotterbeck, Senn

2006). Scaling, in metabolomics data analysis, often refers to the column operations that are

performed on each feature (spectral intensity or metabolite concentration) across all samples in

order to make the features more comparable. Scaling can affect the results of subsequent

statistical analysis and we briefly discussed this problem the Principal Component Analysis

below. Commonly used scaling methods include but not limited to autoscaling, Pareto scaling

and range scaling. More detailed discussion on these methods can be found in (van den Berg,

Hoefsloot, Westerhuis, Smilde, van der Werf 2006) and (Timmerman, Hoefsloot, Smilde,

Ceulemans 2015).

Data preprocessing for MS typically include noise filtering, baseline correction,

normalization, peak alignment, peak detection, peak quantification and spectral deconvolution.

One should note that not all methods use all of the processing steps listed above, nor do they

14

necessarily perform them in the same order (Coombes, Tsavachidis, Morris, Baggerly, Hung,

Kuerer 2005). Although many preprocessing steps of MS are similar to NMR, there are still

some differences. First, a noise filtering step is often associated with MS data preprocessing to

improve peak detection (Blekherman, Laubenbacher, Cortes, Mendes, Torti, Akman, Torti et al.

2011). There are many different noise filters, such as Savitzky-Golay filter, Gaussian filters and

wavelet based filters, and wavelet based methods provides the best average performance (Yang,

He, Yu 2009), due to its adaptive, multi-scale nature (Coombes, Tsavachidis, Morris, Baggerly,

Hung, Kuerer 2005). Second, a de-isotoping step, which is specific to MS data, can be used to

cluster the isotopic peaks corresponding to the same compounds together to simplify the data

matrix (Vettukattil 2015). Third, deconvolution is an important step to separate overlapping

peaks in order to improve peak quantification. However, deconvolution also has the potential to

introduce errors and extra variability to the process (Coombes, Tsavachidis, Morris, Baggerly,

Hung, Kuerer 2005). There are many software tools available for NMR and MS data

preprocessing, a comprehensive summary of software tools can be found in (Vettukattil 2015).

OVERVIEW OF METABOLOMICS DATA ANALYSIS

There are two different approaches to processing metabolomics data: chemometrics and

quantitative metabolomics. For the former, we directly perform statistical analysis on spectral

patterns and signal intensity data and identify metabolites in the last step if needed. For the latter,

we identify all metabolites first and then analyze the metabolite’s data directly. Compared to

quantitative metabolomics, the key advantage of chemometrics profiling is its ability of

automated and nonbiased assessment of metabolites data. But it requires a large number of

spectra and strict sample uniformly, which are less concerned in quantitative metabolomics.

15

Therefore, quantitative metabolomics is more amenable to human study or studies that require

less day to day monitoring (Matthiesen, SpringerLink (Online service) 2010). However, the data

analysis methods behind them are similar. In this section, we will discuss four different types of

data analysis methods. Note that these methods are not totally independent; they differ only by

serving different research purposes. Within each type of data analysis method, we select the most

basic, important and widely used models or methods based on published research and review

papers we found in metabolomics. The methods we selected cover most core methods currently

in use on metabolomics data analysis platforms, such as MetaboAnalyst (Xia, Mandal,

Sinelnikov, Broadhurst, Wishart 2012; Xia, Psychogios, Young, Wishart 2009). We also

included methods beyond the scope of such platforms. We gave brief introduction of the

background, models and algorithms, important facts and potential limitations for each method

that we discussed in detail, together with important references and illustrative examples. For the

methods not discussed in great detail, we listed a few key references. At the end, we briefly

summarized all methods discussed in Table 2 in order to offer readers a clear overview of these

methods.

Unsupervised Learning Methods

When we receive the data after pre-processing, we may wish to obtain a general idea of its

structure. Unsupervised learning methods allow us to discover the groups or trends in the data.

The word “unsupervised” here implies that the data we analyze is unlabeled with class

membership. The purpose of unsupervised learning is to summarize, explore and discover.

Therefore we may only need a few prior assumptions and a little prior knowledge of the data.

16

Unsupervised learning is usually the first step in data analysis and can help visualize the data or

verify any unintended issues with the DOE. Among the many different unsupervised learning

methods, we will discuss four of the most commonly used methods in metabolomics data

analysis.

1. Principal Component Analysis (PCA)

If we have a high-dimensional dataset, e.g., dozens or hundreds of metabolites, peak locations, or

spectral bins for each subject, we may wish to find only a few combinations that best explain the

total variation in the original dataset. PCA is one of the most powerful methods to perform this

type of dimension reduction (Jolliffe 2005). The main objective of the PCA algorithm is to

replace all correlated variables by a much smaller number of uncorrelated variables, often

referred to as principal components (PCs), that still retains most of the information in the original

dataset (Jolliffe 2005). Although the number of PCs equals to the number of variables, only a

limited number of PCs are interpreted. Moreover, if the first few PCs can explain a large

proportion of variation in the data, we can visualize the data using a 2-dimensional or 3-

dimensional plot (sometimes called scores and loadings plots) (Fig. 1).

Before we perform PCA, it is recommended to standardize the variables (Jolliffe 2005).

This process consists of centering each of the vectors and standardizing each of them to have

variance equal to 1. By doing so, we actually perform PCA on sample correlation matrix instead

of sample covariance matrix. When the variables have widely differing variance or use different

units of measurement, the variables with the largest variances may dominate the first few PCs

(Jolliffe 2005). If those variances are biologically meaningful, we do not need to standardize all

17

variables; otherwise it is highly recommended to perform standardization before performing

PCA (Johnson, Wichern 2007). We can calculate the sample PC scores matrix (denoted by)

of all subjects (denoted by, which is the original dataset with subjects and variables).

The process can be expressed by the following formulas is the

weight matrix (i.e., loading matrix), is the row of X, which represents the values of the

subject, is the column of , which represents the weights of all original variables on the

PC, is called the score of the subject on the PC. All of the PCs are uncorrelated

and their variances are eigenvalues of sample correlation matrix (or sample variance matrix if not

standardized). The largest eigenvalue corresponds to the first PC, the second largest eigenvalue

corresponds to the second PC, and so on. In order to measure the contribution of a PC to the total

sample variance, we use its corresponding eigenvalue divided by the summation of all

eigenvalues as a measure. This is the percentage of variance explained by the corresponding PC.

There is no rule for how many PCs to keep; we usually make the decision by checking the

“variance explained” measure mentioned above or using a scree plot (Johnson, Wichern 2007) .

Here we use a simple example of microbial metabolomics (Hou, Braun, Michel, Klassen,

Adnani, Wyche, Bugni 2012) to illustrate the basic idea of how to use PCA. Other examples of

how PCA was used in metabolomics studies can be found in (Heather, Wang, West, Griffin

2013) and (Ramadan, Jacobs, Grigorov, Kochhar 2006). In this study, the authors used PCA to

perform strain selection and discover unique natural products. They assumed bacterial strains

producing the same secondary metabolites would group together. The PCA scores and loadings

plots of 47 strains are shown in Fig. 1.

In PCA, the scores plot is mainly used to discover groups while the loading plot is mainly

used to find variables that are responsible for separating the groups. In the loading plot, we

18

mainly check the points that are further from the origin than most other points in the plot. For the

scores plot (Fig. 1a), we can see seven identifiable groups; these groups were identified by

human eye. In the loading plot (Fig. 1b), point 1 corresponds to a compound that is responsible

for separating group G7 from the other groups. We cannot judge which of the points in the

loading plot are responsible for separating subjects into groups using only these two plots;

instead, we should go back to the loading matrix to check the weights. Furthermore, the

groups shown in the PCA scores plots are not necessarily the biologically meaningful groups.

PCA often provides a clue for further investigation. Although the authors mentioned 74 PCs that

were generated, which explained 98% of the variation in the data set, they did not show how

much of the variance was explained by the first two PCs. Note that PCA may not be powerful if

the first few PCs cannot explain a large proportion of variability of the sample. For example, if

the first two PCs in the plot account for only 50% of the total variation, then the visualization

results may be misleading, so we cannot identify the groups of the strains only by graphs.

2. Clustering

Unlike PCA, clustering analysis explicitly aims at identifying groups in the original dataset. All

clustering algorithms group the subjects such that the subjects in the same group or cluster are

more similar to each other than to subjects in other groups. Different algorithms may use

different similarity measures, such as various distances and correlation coefficients. Among the

many different clustering methods, we will only introduce two most common methods in

metabolomics as well as in many other areas of data analysis.

19

(1) K-means Clustering

K-means clustering is centroid-based clustering and is a type of partitioned clustering method

(Hartigan, Wong 1979). Here centroid-based indicates that each cluster can be represented by a

center vector, which may not be an observation in the original dataset, . Partitioning

requires each subject to appear in exactly one cluster. K-means clustering divides subjects into k

non-overlapping clusters such that each subject belongs to the nearest mean of the corresponding

cluster. If all of the variables are numerical, we generally choose Euclidean distance as the

metric that distinguishes between the subject and the center vector. When using Euclidean

distance fails to find meaningful clusters, we may consider using other distance metrics, for

example Mahalanobis distance, which has the form . It is clear that

Euclidean distance is a special case of Mahalanobis distance when A is identity matrix. In

general, A is a covariance matrix with unknown form, a general and efficient algorithm (Xing,

Jordan, Russell, Ng 2002) can be used to learn the parameters in A together with performing K-

means clustering. Variations of the K-means clustering algorithm include using median instead

of mean as the center vector and assigning weights to each variable. There are also some

drawbacks to K-means clustering methods. The major problem is that the number of clusters,

“K”, is an unknown parameter, and thus we must determine K before we employ the algorithm.

Visualization tools such as PCA, multidimensional scaling (MDS) and self-organizing map

(SOM) may help to determine K. There are also some statistical methods for estimating K, the

most widely used methods are gap statistic (Tibshirani, Walther, Hastie 2001) and weighted gap

statistic (Yan, Ye 2007). Another problem is that K-means assumes that subjects in each cluster

are distributed spherically around the center (Hamerly, Elkan 2003). This assumption may lead

to poor performance on data with outliers or with clusters of various sizes or non-globular shapes

20

(Ertöz, Steinbach, Kumar 2003). An adaptive Fuzzy c-means clustering (Gunderson 1982;

Gunderson 1983) can be used in these cases. Fuzzy c-means clustering (Bezdek, Coray,

Gunderson, Watson 1981; Dunn 1973) is an extension of K-means where each data point

belongs to multiple clusters to a certain degree, which is called membership value. The adaptive

Fuzzy c-varieties clustering algorithm (Gunderson 1983) which is based on (Gunderson 1982) is

a data dependent approach that can seek out cluster shapes and detect a mixture of clusters of

different shapes. Therefore, it removes the limitation of imposing non-representative structures

in K-means and Fuzzy c-means clustering. An alternative way to solve the arbitrary clusters

shape problem is using kernel K-means (Schölkopf, Smola, Müller 1998) and it was suggested in

(Jain 2010). Another limitation of K-means and other clustering methods is that some variables

may hardly reflect the underlying clustering structure (Timmerman, Ceulemans, Kiers, Vichi

2010). One possible way to solve that problem is performing K-means in reduced space (De

Soete, Carroll 1994). Many methods have been proposed to improve the original reduced K-

means (De Soete, Carroll 1994), including factorial K-means (Vichi, Kiers 2001) and subspace

K-means (Timmerman, Ceulemans, De Roover, Van Leeuwen 2013). An alternative solution is

using variable selection (Steinley, Brusco 2008) or variable weighting (Huang, Ng, Rong, Li

2005). An illustrative example of using K-means clustering on metabolites profiles to explore

dietary intake patterns can be found in (O'Sullivan, Gibney, Brennan 2011). More details of how

to use fuzzy c-means in metabolomics was explained in (Li, Lu, Tian, Gao, Kong, Xu 2009).

(2) Hierarchical Clustering

Hierarchical clustering (Johnson 1967) builds a hierarchy and uses a dendrogram to represent the

hierarchical structure. Unlike K-means clustering, hierarchical clustering does not provide a

21

single partition of the dataset. It only shows the nested clusters organized as a hierarchical tree

and lets the user decide the clusters. In order to form the hierarchical tree, we must choose the

similarity metric between pairs of subjects and pairs of clusters. The similarity metric between

two subjects is distance. Different clusters will form by using different distance functions.

Commonly used distance functions include Euclidean distance, Manhattan distance,

Mahalanobis distance and maximum distance. A general discussion of distance functions can be

found in (Jain, Murty, Flynn 1999). Based on the distance function we choose, we can construct

a distance matrix for all subjects before we perform hierarchical clustering. Then we need to

select a linkage function, which is the similarity metric for pairs of clusters. Different linkage

functions will lead to different clusters. Commonly-used linkage functions include single

linkage, complete linkage and average linkage. A general discussion of linkage functions can be

found in (Hastie, Tibshirani, Friedman, Hastie, Friedman, Tibshirani 2009). An advantage of

hierarchical clustering over K-means is that it does not stop at a special number of clusters

found, but will continue to split until every object in the dataset belongs to the same cluster.

Therefore, the hierarchical tree may provide some meaningful finding of the real structure of the

dataset. However, it also has some drawbacks, for example it may not be robust to outliers.

Hierarchical clustering is often used together with a heat map to visualize the data matrix.

Heat maps use different colors to represent different values in the data matrix. The values in the

data matrix can be either the value of some variable or some statistic, e.g., correlation coefficient

or p-value. We can add hierarchical clustering trees on the side or top of the heat map so that we

can clearly see the structure of the data. There is a good example of this kind of representation

from (Poroyko, Morowitz, Bell, Ulanov, Wang, Donovan, Bao et al. 2011).

22

In this paper, the authors studied the effect of different diets on selecting different intestinal

microbial communities using a metabolomics approach. They used a heat map to show the

significant p-values associated with the relationship between metabolites and bacterial taxa in

piglet cecal content (see Supplementary Fig. 1). The dendrogram on the left side shows the

hierarchical structure of different genus of bacteria; the one on the top shows the hierarchical

structure of different metabolites. This graph helps us visualize the degree to which bacteria were

associated with the same or different metabolites. Another similar example of using hierarchical

clustering together with heat map representation can be found in (Draisma, Reijmers, Meulman,

van der Greef, Hankemeier, Boomsma 2013), which used hierarchical clustering to analyze

blood plasma lipid profiles of twins.

3. Self-Organizing Map (SOM)

SOM is a powerful tool to visualize high-dimensional data (Kohonen 1990); it can thus help us

visually discover the clusters in the data. It is an arrangement of nodes in a two-dimensional

(may also be 1D or 3D) grid. The nodes are vectors whose dimension is the same as input

vectors. Since SOM is a type of artificial neural network (ANN), the nodes are also called

neurons. Unlike other types of ANNs, SOM uses a neighborhood function to connect adjacent

neurons. The neighborhood function is a monotonically decreasing function of iterated times and

the distance between the neighborhood neurons and neuron that matches the input best. It defines

the region of influence that the input pattern has on the SOM and the most common choice of the

function is Gaussian function. More technical details can be found in (Kohonen 1998). In this

way, data points located closely in the original data space will be mapped to neurons nearby.

Every node can thus be treated as an approximation of a local distribution of the original space,

23

and the resulting map retains its topological structure. Before implementing SOM, one may

choose the number of nodes and the shape of grids, either hexagonal or rectangular. Using a

hexagonal grid implies that one node will have six bordering nodes. After numerous updating

cycles, each subject is finally assigned to a corresponding neuron, and neighboring neurons can

be treated as mini clusters. These mini clusters may give hints of metabolic patterns. In order to

see the clusters clearly, a unified distance matrix (U-matrix) representation can be constructed on

the top of SOM. The U-matrix nodes are located among all neighborhood neurons, and the color

codes of each node represent the average Euclidean distance among weight vectors of

neighboring neurons (Ultsch 2003). Therefore, the gaps between clusters can be shown by the

colors of U-matrix nodes.

There is an example of SOM with a U-matrix from in metabolomics research (Haddad,

Hiller, Frimmersdorf, Benkert, Schomburg, Jahn 2009). In their paper, the SOM (see

Supplementary Fig. 2) was trained with the metabolome data from three different fermentations

of C. glutamicum. The color codes show different Euclidean distances between each output node

and its four bordering nodes. White – purple represents short distance, which implies the subjects

(black points) have similar metabolic patterns, while green – yellow represents long distance,

which implies the subjects have different metabolic patterns. We can see that the clusters were

clearly separated by the green – yellow gaps. SOM have also been used to visualize metabolic

changes in breast cancer tissue (Beckonert, Monnerjahn, Bonk, Leibfritz 2003) and to improve

clustering of metabolic pathways (Milone, Stegmayer, López, Kamenetzky, Carrari 2014).

Supervised Learning Methods

24

The purpose of supervised learning is different from that of unsupervised learning. Supervised

learning methods are widely used in discovering biomarkers, classification, and prediction, while

unsupervised learning methods cannot complete these tasks. However, these distinctions do not

imply that supervised methods are superior to unsupervised methods; rather, each was designed

to achieve different objectives of analysis. Supervised learning deals with problems or datasets

that have response variables. These variables can be either discrete or continuous. When the

variables are discrete, e.g., control group vs. diseased group, the problems are called

classification problems. When the variables are continuous, e.g., metabolite concentration or

gene expression level, the problems are called regression problems. The purpose of supervised

learning is to determine the association between the response variable and the predictors (often

referred to as covariates) and to make accurate predictions. It is called supervised learning

because one or more response variables are used to guide the training of the models. Usually

both a training step and a testing step are included. Supervised learning algorithms are applied on

the training dataset to fit a model, and then the testing dataset is used to evaluate the predictive

power. In these steps, we may encounter the following problems: How to extract or select better

predictors? How to evaluate the fitness and predictive power of the model? And what learning

methods and algorithms to choose?

For the first problem, the process of choosing relevant predictors is called feature selection

or variable selection. There are three main types of feature selection methods: Wrapper, Filter

and Embedded (Guyon, Elisseeff 2003). The Wrapper method scores subsets of variables by

running every trained model on the test dataset and selecting the model (subset of variables) with

the best performance. The Filter method scores subsets of variables by easy-to-compute

measures before training the models. The Embedded method, just as its name implies, completes

25

feature selection and model construction at the same time. For the second problem, we first need

goodness of fit statistics to measure model fit and predictive power. Commonly used statistics

include but are not limited to: root mean square error (RMSE) for regression; sensitivity,

specificity and the area under the Receiver-Operating Characteristic (ROC) curve for binary

classification. In addition, we need test datasets to assess the predictive power and avoid over-

fitting issues. Ideally, model validation should be performed using independent test datasets;

however, gathering objective data can be expensive due to limited resources and other pragmatic

factors. Therefore, various resampling methods are often used in order to reuse the data

efficiently. These methods include cross validation, bootstrapping, jackknifing, randomization of

response variables and some others. Among all of them, bootstrapping and cross-validation are

used more often in validation supervised learning models (Hastie, Tibshirani, Friedman, Hastie,

Friedman, Tibshirani 2009). Commonly used cross-validation methods include k-fold validation

and random sub-sampling validation. Together with resampling methods, we can obtain a set of

goodness-of-fit statistics. By averaging them we can obtain a single statistics indicating the

fitness and predictive power of the model. For example, if the average of k RMSEs, which can

be the result of k-fold validation of model A, is lower than those average RMSEs of other

models, then we can conclude that model A is the better one under RMSE criteria. For the third

problem, there are many different supervised learning methods to choose from. Here we briefly

introduce two of the most widely used methods in metabolomics.

1. Partial Least Squares (PLS)

PLS (Wold 1966) is a method of solving linear models. A general linear model has the form

, where is the response variable, it can be a vector (one variable) or a matrix

26

(several variables); is the design matrix whose columns represent variables and rows represent

observations; is the vector (matrix) of parameter coefficients and is the random error vector

(matrix) (Martens 1992). Generally, we use the ordinary least square solution of, which

is. However, in metabolomics analyses, we always have a large number of

variables, such as metabolites, peak locations, and spectral bins, but a relatively small number of

observations. Moreover, these variables may be linearly dependent, and thus it will be impossible

to use the conventional least squares method to solve for in the linear regression model, since

it is impossible to invert the singular matrix. At first, Principal Component Regression

(PCR) was introduced to solve this problem. Instead of using all original variables, PCR uses the

first few PCs from PCA to fit the linear regression model. But it is not clear whether those PCs

have high correlation with response variables or not. Therefore, PLS was introduced to tackle

this problem (Wold, Ruhe, Wold, Dunn 1984). PLS may also stand for projection to latent

structures, which implies how this method works. The underlying model of the PLS method has

the form (Wold, Sjöström, Eriksson 2001):

Similar to PCA, and are called and scores, which are matrices formed by latent

variables; and are called and loadings, which can be thought of as weight matrices;

and are residuals, which are the remaining amounts that cannot be explained by latent

variables. The latent variables, which can be thought of as factors, are linear combination of the

original and variables, i.e., for each latent variable and , and ; and

are called weight vectors. These latent variables may have chemical or biological meanings. The

PLS method finds a best set of variables that can explain most of the variation of . Namely,

we should find each latent variable t and u, such that, under some orthogonal conditions, their

27

covariance reaches its maximum value (Abdi 2010). There are many variants of the PLS and

corresponding algorithms, which may have different orthogonal conditions and different

methods to estimate scores and loading matrices. It is important to note that PLS is different

from PCA and PCR. First, PCA is an unsupervised learning method while PCR and PLS are

supervised learning methods. Second, PCR uses the first few PCs in the PCA as predictors to fit

a latent variable regression. Thus, PCA only explains the variance in itself while a PLS model

tries to find the multi-dimensional direction in the space that explains the maximum variance

direction in the space. Therefore, the PLS method may often perform better than PCR.

The only parameter we need to specify in PLS is the number of components to keep. There

are two approaches. First, we can use plots to help us decide the components, e.g., the and

scores plot or R-square plot. Another approach is using resampling methods together with a

measure of goodness of fit or predictive power. We can select different numbers of components

and check their goodness of fit or predictive power. Since the PLS method is a dimension

reduction method itself, feature selection is not a required step in PLS. However, in order to

improve interpretation, robustness and precision, there are also some feature selection methods

that can be used with PLS. For example, we can use a two-sample t-test, a filter method, to select

variables before running PLS. Sparse PLS (SPLS), which is an embedded method, imposes

sparsity when constructing the direction vectors, thereby improves interpretation and achieves

good prediction performance simultaneously (Chun, Keleş 2010). Another method called

orthogonal projections to latent structures (OPLS) (Trygg, Wold 2002), can be embedded as an

integrated part of PLS modeling to remove systematic variation in that is orthogonal to , thus

also enhancing the interpretation of PLS.

28

Although PLS was first designed to deal with regression problems, it can also be used in

classification problems. One popular method is called PLS-Discriminant Analysis (DA)

(Boulesteix 2004; Nguyen, Rocke 2002). In PLS-DA, is a vector whose value represents class

membership. When considering model validation in PLS or PLS-DA, Predicted Residual Sum of

Squares (PRESS), can be used in addition to the commonly used diagnostic methods

mentioned above. Note that is a measure of fitness of the model to the training data set while

and PRESS are used to evaluate the predictive power of the model. For the PLS-DA method,

it is recommended to use a double cross-validation procedure (Szymanska, Saccenti, Smilde,

Westerhuis 2012) along with the number of misclassifications and the area under the ROC curve

as diagnostic statistics. Using similar algorithms as PLS-DA, other variants of PLS, like SPLS

and OPLS mentioned above, can also be extended to classification problem, where they called

SPLS-DA (Chung, Keles 2010) and OPLS-DA (Bylesjö, Rantalainen, Cloarec, Nicholson,

Holmes, Trygg 2006).

Here we use an example to illustrate some application aspects of the PLS model (Kang,

Park, Shin, Lee, Oh, Ryu do, Hwang et al. 2011) (Fig. 2). The authors used OPLS-DA to classify

coronary heart failure (CHF) groups and control groups. Fig. 2a is a scores plot of the first two

components, which shows the similarities and dissimilarities of the subjects. In this plot, we can

see that the diseased and control groups can be clearly separated by the OPLS-DA model. Fig.

2b is the corresponding loadings plot. Different from a PCA loadings plot, the metabolites

identified are responsible for the classification. The upper section of Fig. 2b shows that

metabolites increased in the control group while the lower section shows that metabolites

increased in heart failure group. They ran the OPLS-DA model with NMR spectra data (which is

29

the input data matrix ) and then identified the metabolites responsible for the separation using

results in Fig. 2b.

PLS method has been successfully applied in numerous metabolomics studies for disease

classification and biomarker identification (Marzetti, Landi, Marini, Cesari, Buford, Manini,

Onder et al. 2014; Velagapudi, Hezaveh, Reigstad, Gopalacharyulu, Yetukuri, Islam, Felin et al.

2010; Zhang, Gowda, Asiago, Shanaiah, Barbas, Raftery 2008). Note that the PLS method

sometimes can also be used as a dimension reduction (feature selection) tool rather than as a

classification method (Bu, Li, Zeng, Yang, Yang 2007).

2. Support Vector Machine (SVM)

Since metabolomics data is represented in matrix form, every subject is a row vector; thus, each

subject can be viewed as a point in a p-dimensional space where p is the number of variables. If

we can separate the data into two groups, intuitively we can find a “gap” between these two

groups in the p-dimensional space. SVM tries to find such a gap that is as wide as possible

(Cortes, Vapnik 1995). The margins for the gap are defined by support vectors, i.e., the points

located on the margin. SVM is trained to determine the support vectors. The boundary in the

middle of the gap that separates the data is called the separating hyper plane. The prediction is

done by deciding to which side of the hyper plane new subjects (observations) belong.

The original SVM algorithm is a linear classifier, which means it can only produce a hyper

plane (p-1 dimension plane in p-dimensional space) to classify the data. We aim to find the

largest margin, i.e., the largest distance between two groups, which can be solved by quadratic

programming. The related mathematical expression of this problem has been documented by

Bishop (Bishop 2006). However, it is quite common that the data cannot be linearly separated,

30

i.e., a separating hyper plane does not exist. In this case, we can use kernel trick to map the

original data to a higher dimensional space so that it can be linearly separated in that space.

Kernel trick or kernel substitution is very useful in extending algorithms. It substitutes the inner

product (linear kernel) with other kernels. Commonly used kernels include the polynomial kernel

and the Gaussian kernel (Bishop 2006).

Another problem is the stability of the algorithm. If there are outliers or mislabeled data,

the original SVM may give an unsatisfactory classification result. In this case, we can use SVM

with a soft margin to solve this problem. The soft margin SVM allows some misclassification in

the training step by adding slack variables to the original objective function. This modification

changes our objective from maximizing the margin between two groups to maximizing the

margin as cleanly as possible. Here “clean” means only a few misclassified subjects.

Since SVM is a well regularized method, it does not always require a feature selection step.

However, there are some feature selection methods that can be used to enhance the performance

of SVM and lower its computational cost. Examples include the recursive feature elimination

(RFE) method and L1 norm SVM (Guan, Zhou, Hampton, Benigno, Walker, Gray, McDonald et

al. 2009). As discussed in the PLS section, similar validation methods and diagnostic measures

can be applied to the SVM algorithm. Moreover, these validation methods and diagnostic

measures can help us select optimum parameters such as which kernel to choose and its

parameters.

Compared to the PLS-DA method, one minor disadvantage of SVM is that it is often

difficult to visualize and interpret the classification result using plots; especially when the

number of variables is large. However, for classification purposes, we still recommended SVM

over most other methods (Mahadevan, Shah, Marrie, Slupsky 2008), and SVM has been widely

31

used for classification and prediction in metabolomics research, especially in cancer research

(Guan, Zhou, Hampton, Benigno, Walker, Gray, McDonald et al. 2009; Henneges, Bullinger,

Fux, Friese, Seeger, Neubauer, Laufer et al. 2009; Stretch, Eastman, Mandal, Eisner, Wishart,

Mourtzakis, Prado et al. 2012). Moreover, SVM can also be used in regression problems, where

it is called support vector regression (SVR) (Brereton, Lloyd 2010). Detailed discussions on

SVR and its applications in metabolomics and chemometrics can be found in (Li, Liang, Xu

2009).

Pathway analysis methods

Pathway analysis allows us to detect the biological mechanisms in which identified metabolites

are involved. Some metabolic pathway analysis methods are directly borrowed from gene

pathway analysis, e.g., over-representation analysis (ORA) and enrichment score. Here we

provide a brief introduction to ORA, Functional Class Scoring (FCS) and some pathway

simulation methods.

1. Over-Representation Analysis (ORA)

During the research there may be cases where we have a list of metabolites identified and we

only want to know which pathways are involved in the samples being studied. There are many

metabolic pathway databases available on the internet. In this case the pathways are already

specified and we only need to test which pathway is significantly involved based on the available

samples. This kind of pathway analysis is called knowledgebase-driven pathway analysis

(Khatri, Sirota, Butte 2012). Among all of the knowledgebase-driven pathway methods, ORA is

32

well-known and the simplest. ORA is used to test whether pathways are significantly different

between two study groups. Before performing ORA, we should have a list of metabolites

showing significant differences between two groups. This can be done by using two sample tests,

e.g., t-tests or nonparametric tests, for all metabolites. Then we select metabolites whose

significance reaches a predetermined threshold for false discovery rates (FDRs) or p-values.

Next, we can perform ORA, which is equivalent to a 2×2 contingency table (Table 1) test in

statistics. After we obtain all related pathways from knowledgebase, we can count the number of

metabolites in or not in both a known pathway and the list, then perform a statistical test for

whether this pathway is significantly involved for each known pathway. The most frequently

used tests include the chi-square test, which requires larger sample sizes, and Fisher’s exact test,

which is more appropriate for smaller cell counts in the table and uses a hypergeometric

distribution (Agresti 2014).

2. Functional Class Scoring (FCS)

ORA is simple to perform, but it has several drawbacks. First, much information is lost since

only the most significant metabolites are used and the rest are ignored, and only the number of

identified metabolites is considered. Second, the optimal threshold is unclear. Third, it assumes

improper independence. For example, it assumes that each metabolite and pathway is

independent of others, but in reality, this assumption may not be valid. Therefore, another class

of methods called Functional Class Scoring (FCS) was proposed to address some of the

limitations in ORA. A general framework of univariate FCS methods works as follows: First,

obtain single-metabolite statistics (e.g., t-statistic and z-statistic) by computing differential

expression of individual metabolites. Second, aggregate those single-metabolite statistics to

33

compute a pathway level statistic. This pathway level statistic can be univariate or multivariate.

Commonly used univariate pathway level statistics include mean, median and enrichment score

(Holmans 2010). For multivariate statistics, a widely used statistic is Hotelling’s statistic,

which has an F distribution under the null hypothesis (Johnson, Wichern 2007). The final step is

hypothesis testing. There are two kinds of null hypothesis: competitive and self-contained. A

competitive test considers metabolites both within and outside of the pathway, while a self-

contained test ignores metabolites that are not in the pathway. In other words, for a competitive

test, the null hypothesis being tested is that the association between the specific pathway and

disease is average; while for a self-contained test, the null hypothesis is that there is no

association between the specific pathway and disease (Holmans 2010). For the multivariate

statistics, the null hypothesis is self-contained since the null hypothesis is that there is no

association between metabolites in the pathway and the phenotype (Holmans 2010). Although

multivariate statistics take the correlation of different metabolites into account, they may not be

necessarily more powerful than univariate statistics (Khatri, Sirota, Butte 2012). There are also

some drawbacks of the FCS method. If two pathways have the same metabolites, FCS will give

the same result. That is, FCS does not take the reactions among metabolites (topology structure)

into account. One way to address this problem is to use a correlation measure, such as the

Pearson correlation coefficient, to help us choose the most suitable pathway; another method is

to use pathway reconstruction.

3. Metabolic pathway reconstruction and simulation

Metabolic pathway/network reconstruction and simulation are a batch of methods used to refine

or construct metabolic networks. A reconstruction collects all of the relevant metabolic

34

information of an organism and compiles it in a mathematical model. The relevant metabolic

information includes all related known chemical reactions, previously constructed networks,

experimental data, and related research results. After compiling the data into a model, we can

obtain the output of the system and then use it to refine our model and perform the simulation

iteratively. If we have the knowledge of all involved metabolites, then we can enhance the

predictive capacity of the reconstructed models by connecting the metabolites within the

pathways. The pathway models can be roughly classified into one of two categories: static

(stoichiometric network models) and kinetic models. We will first discuss static models and then

give a brief introduction on kinetic modeling.

The mathematical model behind static models is a linear system. If we treat the metabolic

network as a system, then, based on mass conservation of internal metabolites within a system,

we can express the reaction network by a stoichiometric matrix (). Each element of ()

represents the coefficient of metabolite involved in reaction . At a steady state, we have

, since there is no accumulation of internal metabolites in the system (Schilling, Schuster,

Palsson, Heinrich 1999). This linear system is called the flux-balance equation. Here

represents fluxes through the associated reactions in . These linear equations define the entire

reaction network; all of the solutions to this linear system are valid steady-state flux

distributions. In general, is an matrix where the number of columns is larger than the

number of rows ( ), which means there are more reactions than metabolites (Schilling,

Schuster, Palsson, Heinrich 1999). Moreover, is a full rank matrix, which means

. Therefore, there are multiple solutions to this linear system. Different solutions define

different pathways. Based on different research purposes, we may impose different constraints

on the linear system and obtain different types of solutions.

35

Here we introduce three kinds of solutions that are the most widely used in metabolic

pathway analysis. The first type of solution is called elementary modes. In addition to the flux-

balance equation, we add the constraints that all fluxes are greater than 0. Then, by applying the

convex analysis method, we can find a solution set called elementary modes (EM) if the

following properties are satisfied:

(i) Uniqueness: The solution set is unique for a given network.

(ii) Non-decomposability: Each solution in the solution set consists of the minimum number of

reactions that it needs to exist as a functional unit. If any reaction in a solution set were removed,

the whole solution set could not operate as a functional unit.

(iii) The solution set is the set of all routes through a metabolic network consistent with the

second property (Papin, Stelling, Price, Klamt, Schuster, Palsson 2004).

The second type of solution is called extreme pathways (EP). By convex analysis, the

solution set is called the extreme pathways if it is under the same constraints of EM and follows

properties below:

(i) Uniqueness: The solution set is unique for a given network.

(ii) Non-decomposability: Each solution in the solution set consists of the minimum number of

reactions that it needs to exist as a functional unit.

(iii) This solution set is the systemically independent subset of the elementary modes; that is, no

solution in the set can be represented as a nonnegative linear combination of any other solutions

in the solution set, namely, they are convex basis vectors (Papin, Stelling, Price, Klamt, Schuster,

Palsson 2004).

By using these two kinds of solutions, we can analyze or construct metabolic pathways and

networks. Note that we may have a finite number of solutions for EM and EP, which means that

36

we may obtain several different pathways. A numerical example of the calculation and use of the

EM and EP can be found in (Förster, Gombert, Nielsen 2002). The key difference between EM

and EP is that they treat internal reversible and irreversible reactions differently. EP analysis

decouples all internal reversible reactions into forward and reverse directions while EM analysis

accounts for reaction directionality through a series of rules in the corresponding calculations of

the modes. Moreover, EPs are subsets of EMs, i.e., the numbers of extreme pathways are smaller

(potentially much smaller) than or equal to the number of elementary modes.

Another important solution corresponds to an independent analysis method called Flux

Balance Analysis (FBA). FBA differs from EM or EP by imposing more constraints and an

objective function. Depending on the purpose of the research being performed, we have different

problems of interest pertaining to the pathway. For example, we may want to maximize or

minimize the flux of certain reactions; or want to limit some flux to a certain interval and see

how the pathway changes. Therefore we need to impose an objective function on the linear

system. The problem is thus transformed into an optimization problem. We can use linear

programming to solve this problem; it does not matter how many linear constraints we have to

impose on the flux vector. Note that FBA generally gives only one solution. This is in contrast

to EM and EP, which give several solutions. Fig. 3 shows how to perform an FBA (Raman,

Chandra 2009). First, we specify the reaction network that contains all metabolites and detailed

information on all possible reactions (Fig. 3, 1st step). Internal fluxes are denoted by

and exchange fluxes are denoted by . Then, after building the linear system

based on the network structure (Fig. 3, 2nd and 3rd steps), we can add a biologically relevant

objective function and relevant constraints (Fig. 3, 4th and 5th steps). The remainder of the task is

37

linear programming (Fig. 3, last step), which can be accomplished using software packages, such

as MATLAB COBRA toolbox (Becker, Feist, Mo, Hannum, Palsson, Herrgard 2007).

The pathway constructed using the methods above cannot show the dynamic state such as

regulatory effects, how the enzymes work, or whether the pathway is in stable steady state.

Therefore, we may use a kinetic model to simulate the metabolomics network (Tomar, De 2013).

A kinetic reaction network model can be described by ordinary or partial differential equations

(ODE or PDE, respectively). For example, we can simulate the network based on the following

simple ODE (Steuer 2007).

;

is the time-dependent concentration vector of m internal metabolites, represents the

Michaelis-Menten kinetics parameters, and S is the stoichiometric matrix. is a vector of

enzyme-kinetic rate equations that consists of nonlinear functions of and . We can see that if

we let the left-hand-side term equal zero (indicating that it is at a steady state) and let be a flux

vector, then the equation is exactly the flux balance equation. Given an initial condition of ,

the value of the kinetic parameters and the rate equations , we can simulate the data through the

ODE given above. A common choice for the rate equations (for every in) is

,

where is the maximal reaction velocity. However, sometimes we do not know the explicit

form of the rate equations or it is difficult to estimate the kinetic parameters k. In these cases, we

can use the structural kinetic modeling method (SKM) (Steuer 2007). The SKM method uses the

Jacobian matrix as a local linear approximation of the rate equations. The Jacobian matrix

consists of all first order partial derivatives of the rate equations and it can be rewritten and

estimated using the SKM method (Wiechert 2002) and (Steuer 2007).

38

Analysis methods for time course data

Variables, for example the concentration of metabolites, may change with time, thereby creating

a time dimension in the dataset. Unsupervised learning and data visualization tools are still

initially useful for giving us a general idea of the data structure. We can also use visualization

tools such as PCA, SOM, and heat maps with a hierarchical clustering structure to detect patterns

and groups/clusters of the data. The only difference is that we should include a time dimension.

In addition, by drawing profile graphs we can check the profiles of metabolites or subjects for

different clusters. However, if we want to compare the temporal profiles (similar or different

patterns of change) of metabolites between different subjects or groups of subjects, we need to

introduce statistical methods different from those described above. Among different statistical

methods that can be used to analyze time course data (Smilde, Westerhuis, Hoefsloot, Bijlsma,

Rubingh, Vis, Jellema et al. 2010), we only introduce analysis of variance (ANOVA) based

methods.

If we are analyzing one variable over time, e.g., the expression level of a protein or

concentration of a metabolite, and we want to test whether the temporal profiles of this variable

are significantly different under different experimental conditions, a natural choice is to use two-

way ANOVA, which is often used when studying the effects of different treatments in chemical

or biological experiments. Here we show the basics of this ANOVA model. In metabolomics

research, the experimental condition (α) and time effect (β) can be treated as two fixed effects.

The general linear model for a two-way ANOVA is:

39

where refers to the measurement obtained from the subject at the time point under

the condition ( ); α and are fixed effects

corresponding to condition and time; a and b are the total number of levels for each effect; n is

the number of replicates (often corresponding to subjects) for each combination of condition and

time effects; µ is the overall (grand) mean. In many applications involving the traditional two-

way ANOVA, are assumed to be independent random errors following a normal distribution,

denoted by (Kutner 2005).

Although it seems reasonable to use a two-way ANOVA to analyze time course data, the

data may be better described by a repeated measures (RM) model. The main difference between

the two-way ANOVA and the RM model is that the RM model has a subject error term (),

which takes variation within the group or subject into consideration. The following is the model

of RM:

Where . are subject effects (within group error)

which follow a normal distribution

and are independent of random error . Other

notations are the same as the two-way ANOVA model mentioned above. In metabolomics

research, there are often some differences between subjects even within the same groups. If the

subjects vary a great deal within each group, then the RM model will be more powerful than the

simple two-way ANOVA analysis (Milliken, Johnson 2009). Since each subject is repeatedly

assessed in most time course studies, we strongly recommend to use the RM model instead of

two-way ANOVA. With the RM model, the user can specify a variety of correlation structures

for the measurement error. Compound-symmetric correlation is often assumed in repeated

measures analyses. This covariance structure for the allows for within-subject correlation

40

that is common over time. If the correlation is expected to decay over time, it is advantageous to

consider autoregressive covariance structures (Brockwell, Davis 2002) or exponential covariance

functions (Szczesniak, McPhail, Duan, Macaluso, Amin, Clancy 2013).

Inference procedure for RM model is quite similar to ANOVA. We can calculate the

statistics and p-values by decomposing the total sum of squares (SST) into different parts as

shown in the following formula:

SST= SS (α) +SS () +SS (α) +SS () +SSE

After validating all assumptions (normality, independence and homogeneity of the variances),

we turn to look at the ANOVA table (Kutner 2005; Milliken, Johnson 2009). The F statistics

each follow an F distribution under the null hypothesis (corresponding effects are all zero) with

corresponding numerator degrees of freedom. Therefore, we have three effects to test: two main

effects (α and ) and an interaction effect. Note that we must always test the interaction between

α and first, since a significant interaction may mask the significance of main effects and

influence the explanation of data. The null hypothesis for the interaction effect is: for

all i and j. If the interaction effect is significant, it implies that the temporal profiles for different

groups (experimental conditions) are different. Usually, the estimated treatment means plots (and

many other plots) will give us a straightforward explanation of main and interaction effects

(Milliken, Johnson 2009).

If the response variable Y is a matrix, then we will need to analyze multiple metabolites

simultaneously, while taking their correlation structure into consideration. In this case, we need

another method called ANOVA-simultaneous component analysis (ASCA). ASCA is a

generalization of ANOVA from the univariate case to the multivariate case. Statistically,

traditional MANOVA is a generalization of ANOVA. However, MANOVA will not work if the

41

covariance matrices are singular or the assumption of multivariate normality is violated. The idea

behind ASCA comes from principal component analysis, which decomposes the original data

matrix into a component score and a loading matrix plus an error term. The following is the full

model (Smilde, Jansen, Hoefsloot, Lamers, van der Greef, Timmerman 2005):

With the following constraints:

; . H denotes the number of groups and denotes the number of

replicates in group h. is a matrix where denotes the number of available time

points and J denotes the number of variables. is a K dimensional vector of ones and is a J

dimensional vector of overall means of variables, so each row in the matrix represents the

overall mean of the variables. is the matrix of “time” effect; represents the “time” and

treatment interaction effect; is the interaction of treatment, time, and subject; is the

matrix of residuals. The corresponding matrices are loading matrices. Unlike in ANOVA, we

will use the scores plots of the first few components for each effect ( matrices, a.k.a., sub-

model) to detect the time main effect or interaction effect; and we can use the corresponding

loadings plots to detect which variables are responsible for the variation. There are some

examples of ASCA score plots from (Nueda, Conesa, Westerhuis, Hoefsloot, Smilde, Talon,

Ferrer 2007).

In their paper, they mainly discussed the application of ASCA on time course microarray

data. As an example, in the score plots of their simulation study (see Supplementary Fig. 3), sub-

42

model a represents the time main effect and sub-model represents the treatment effect

(treatment main effect and its interaction with time effect). The percentages on the left show how

much the components in the sub-model (principal components kept in this matrix) explain the

variation in the corresponding effect (sub-model). The first plot shows the positive time main

effect exists; the following two show differences between subjects on the same treatment. Note

that in this example, the time-treatment effect was not modeled independently, which is different

from that in the original paper of ASCA (Smilde, Jansen, Hoefsloot, Lamers, van der Greef,

Timmerman 2005).

There are many other methods for analyzing time course data that we did not discuss in this

paper, such as the time-series data analysis (ARMA model) (Smilde, Westerhuis, Hoefsloot,

Bijlsma, Rubingh, Vis, Jellema et al. 2010). The ARMA model makes sense only when we have

many more than just two or three time points. If our dataset has many time points and we wish to

find and compare the profile curves, then a functional based method (Berk, Ebbels, Montana

2011) may be a good choice. The paper proposed a smoothing splines mixed effects model that

treats each longitudinal measurement as a smooth function of time and uses a functional t-type

test statistic to quantify the difference between two sets of curves. See (VanDyke, Ren,

Sucharew, Miodovnik, Rosenn, Khoury 2012) for a biomedical application related to this

approach. Furthermore, since metabolomics is well suited to longitudinal studies, if we have

many time points and experimental conditions, we can use the Hierarchical Linear Model to fit

the data. It treats the time profile as a function and the parameters of the time profile function as

random variables (Jansen, Hoefsloot, Boelens, van der Greef, Smilde 2004).

43

CONCLUSIONS

Metabolomics is a rapidly growing field that has greatly improved our understanding of the

metabolic mechanisms behind biological processes as well as human diseases. Its broad goal is

to understand how the overall metabolism of an organism has been changed under different

conditions. Metabolomics has been used to study diseases such as cystic fibrosis, central nervous

system diseases, cancer, diabetes, and cardiac disease. Using metabolomics could lead to the

discovery of more accurate biomarkers that will help diagnose, prevent, and monitor the risk of

disease. This review briefly introduced the background of metabolomics, NMR and MS

strategies, and data pre-processing. We then placed our main focus on the data analysis of

metabolomics and described mainstream data analysis methods in current metabolomics

research. These include unsupervised learning methods, supervised learning methods, pathway

analysis methods and time course data analysis. Finally, in Table 2, we summarized the key

points of the methods discussed, as well as some basic methods such as fold change and two

sample t-test that were not included in this review. We hope our review will be a useful reference

for researchers without this type of background in data analysis.

44

Acknowledgement

We would like to express our great appreciation to Dr. Lilliam Ambroggio and Dr. Lindsey

Romick-Rosendale for their valuable and constructive suggestions to our review. Their

willingness to give their time so generously has been very much appreciated. This study is

funded by the NIH grant R01 HL116226 to RDS and LJL.

45

Permissions

For the figures cited in the manuscript and supplementary materials, we have obtained

permissions from copy right owners for both the print and online format. We can send those

permissions to Metabolomics when needed.

46

Conflict of interest statement

Sheng Ren, Anna A. Hinzman, Emily L. Kang, Rhonda D. Szczesniak and L. Jason Lu declare

that we have no conflict of interest and we have included separately signed conflict of interest

forms in this manuscript.

47

48

49

50

51

52

REFERENCES

Abdi, H. (2010). Partial least squares regression and projection on latent structure regression

(PLS Regression). Wiley Interdisciplinary Reviews: Computational Statistics 2, 97-106

Agresti, A. (2014). Categorical data analysis, John Wiley & Sons.

Anderson, P. E., N. V. Reo, N. J. DelRaso, T. E. Doom, M. L. Raymer (2008). Gaussian binning:

a new kernel-based method for processing NMR spectroscopic data for metabolomics.

Metabolomics 4, 261-272

Armitage, E. G., C. Barbas (2014). Metabolomics in cancer biomarker discovery: current trends

and future perspectives. Journal of pharmaceutical and biomedical analysis 87, 1-11

Assfalg, M., et al. (2008). Evidence of different metabolic phenotypes in humans. Proceedings of

the National Academy of Sciences 105, 1420-1424

Becker, S. A., A. M. Feist, M. L. Mo, G. Hannum, B. Ø. Palsson, M. J. Herrgard (2007).

Quantitative prediction of cellular metabolism with constraint-based models: the COBRA

Toolbox. Nature protocols 2, 727-738

Beckonert, O., J. Monnerjahn, U. Bonk, D. Leibfritz (2003). Visualizing metabolic changes in

breast‐cancer tissue using 1H‐NMR spectroscopy and self‐organizing maps. NMR in

Biomedicine 16, 1-11

Berk, M., T. Ebbels, G. Montana (2011). A statistical framework for biomarker discovery in

metabolomic time course data. Bioinformatics 27, 1979-85

doi:10.1093/bioinformatics/btr289 btr289 [pii]

Bezdek, J. C., C. Coray, R. Gunderson, J. Watson (1981). Detection and characterization of

cluster substructure i. linear structure: Fuzzy c-lines. SIAM Journal on Applied

Mathematics 40, 339-357

Bishop, C. M. (2006). Pattern recognition and machine learning. New York, Springer.

Blekherman, G., et al. (2011). Bioinformatics tools for cancer metabolomics. Metabolomics 7,

329-343 doi:10.1007/s11306-010-0270-3 270 [pii]

Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data.

Statistical applications in genetics and molecular biology 3,

Box, G. E., W. G. Hunter, J. S. Hunter (1978). Statistics for experimenters.

Brereton, R. G., G. R. Lloyd (2010). Support vector machines for classification and regression.

Analyst 135, 230-267

Broadhurst, D. I., D. B. Kell (2006). Statistical strategies for avoiding false discoveries in

metabolomics and related experiments. Metabolomics 2, 171-196

Brockwell, P. J., R. A. Davis (2002). Introduction to time series and forecasting, vol 1, Taylor &

Francis.

Bu, H.-L., G.-Z. Li, X.-Q. Zeng, J. Y. Yang, M. Q. Yang Feature selection and partial least

squares based dimension reduction for tumor classification. In: Bioinformatics and

Bioengineering, 2007 BIBE 2007 Proceedings of the 7th IEEE International Conference

on, 2007. IEEE, p 967-973

Bylesjö, M., M. Rantalainen, O. Cloarec, J. K. Nicholson, E. Holmes, J. Trygg (2006). OPLS

discriminant analysis: combining the strengths of PLS‐DA and SIMCA classification.

Journal of Chemometrics 20, 341-351

Cao, H., J. Dong, C. Cai, Z. Chen Investigations on the effects of NMR experimental conditions

in human urine and serum metabolic profiles. In: Bioinformatics and Biomedical

53

Engineering, 2008 ICBBE 2008 The 2nd International Conference on, 2008. IEEE, p

2236-2239

Chun, H., S. Keleş (2010). Sparse partial least squares regression for simultaneous dimension

reduction and variable selection. Journal of the Royal Statistical Society: Series B

(Statistical Methodology) 72, 3-25

Chung, D., S. Keles (2010). Sparse partial least squares classification for high dimensional data.

Stat Appl Genet Mol Biol 9, Article17 doi:10.2202/1544-6115.1492

Coombes, K. R., S. Tsavachidis, J. S. Morris, K. A. Baggerly, M. C. Hung, H. M. Kuerer (2005).

Improved peak detection and quantification of mass spectrometry data acquired from

surface ‐enhanced laser desorption and ionization by denoising spectra with the

undecimated discrete wavelet transform. Proteomics 5, 4107-4117

Cortes, C., V. Vapnik (1995). Support-vector networks. Machine learning 20, 273-297

Craig, A., O. Cloarec, E. Holmes, J. K. Nicholson, J. C. Lindon (2006). Scaling and

normalization effects in NMR spectroscopic metabonomic data sets. Analytical

Chemistry 78, 2262-2267

Cui, Q., et al. (2008). Metabolite identification via the Madison metabolomics consortium

database. Nature biotechnology 26, 162-164

Davis, R. A., A. J. Charlton, J. Godward, S. A. Jones, M. Harrison, J. C. Wilson (2007).

Adaptive binning: An improved binning method for metabolomics data using the

undecimated wavelet transform. Chemometrics and Intelligent Laboratory Systems 85,

144-154

De Soete, G., J. D. Carroll (1994). K-means clustering in a low-dimensional Euclidean space

New approaches in classification and data analysis.(pp212-219). Springer.

Dettmer, K., P. A. Aronov, B. D. Hammock (2007). Mass spectrometry-based metabolomics.

Mass Spectrom Rev 26, 51-78 doi:10.1002/mas.20108

Dieterle, F., A. Ross, G. Schlotterbeck, H. Senn (2006). Probabilistic quotient normalization as

robust method to account for dilution of complex biological mixtures. Application in 1H

NMR metabonomics. Analytical Chemistry 78, 4281-4290

Draisma, H. H., T. H. Reijmers, J. J. Meulman, J. van der Greef, T. Hankemeier, D. I. Boomsma

(2013). Hierarchical clustering analysis of blood plasma lipidomics profiles from mono-

and dizygotic twin families. European Journal of Human Genetics 21, 95-101

Dunn, J. C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact

well-separated clusters.

Dunn, W. B., N. J. Bailey, H. E. Johnson (2005). Measuring the metabolome: current analytical

technologies. Analyst 130, 606-625

Dunn, W. B., et al. (2011). Procedures for large-scale metabolic profiling of serum and plasma

using gas chromatography and liquid chromatography coupled to mass spectrometry.

Nature protocols 6, 1060-1083

Dunn, W. B., I. D. Wilson, A. W. Nicholls, D. Broadhurst (2012). The importance of

experimental design and QC samples in large-scale and MS-driven untargeted

metabolomic studies of humans. Bioanalysis 4, 2249-2264

Eilers, P. H., B. D. Marx (1996). Flexible smoothing with B-splines and penalties. Statistical

science, 89-102

Emwas, A.-H., et al. (2014). Standardizing the experimental conditions for using urine in NMR-

based metabolomic studies with a particular focus on diagnostic studies: a review.

Metabolomics, 1-23

54

Enea, C., et al. (2010). 1H NMR-based metabolomics approach for exploring urinary

metabolome modifications after acute and chronic physical exercise. Analytical and

bioanalytical chemistry 396, 1167-1176

Ertöz, L., M. Steinbach, V. Kumar Finding clusters of different sizes, shapes, and densities in

noisy, high dimensional data. In: SDM, 2003. SIAM, p 47-58

Fahy, E., M. Sud, D. Cotter, S. Subramaniam (2007). LIPID MAPS online tools for lipid

research. Nucleic Acids Research 35, W606-W612

Förster, J., A. K. Gombert, J. Nielsen (2002). A functional genomics approach using

metabolomics and in silico pathway analysis. Biotechnology and Bioengineering 79, 703-

712

Gentleman, R. C., et al. (2004). Bioconductor: open software development for computational

biology and bioinformatics. Genome biology 5, R80

Gika, H. G., G. A. Theodoridis, R. S. Plumb, I. D. Wilson (2014). Current practice of liquid

chromatography–mass spectrometry in metabolomics and metabonomics. Journal of

pharmaceutical and biomedical analysis 87, 12-25

Griffin, J. L., H. Atherton, J. Shockcor, L. Atzori (2011). Metabolomics as a tool for cardiac

research. Nature 8, 630-643

Griffin, J. L., J. P. Shockcor (2004). Metabolic profiles of cancer cells. Nat Rev Cancer 4, 551-

61 doi:10.1038/nrc1390 nrc1390 [pii]

Griffiths, W. J., T. Koal, Y. Wang, M. Kohl, D. P. Enot, H. P. Deigner (2010). Targeted

metabolomics for biomarker discovery. Angew Chem Int Ed Engl 49, 5426-45

doi:10.1002/anie.200905579

Guan, W., et al. (2009). Ovarian cancer detection from metabolomic liquid

chromatography/mass spectrometry data by support vector machines. BMC

Bioinformatics 10, 259 doi:10.1186/1471-2105-10-259 1471-2105-10-259 [pii]

Gunderson, R. W. (1982). Choosing ther-dimension for the FCV family of clustering algorithms.

BIT Numerical Mathematics 22, 140-149

Gunderson, R. W. (1983). An adaptive FCV clustering algorithm. International Journal of Man-

Machine Studies 19, 97-104

Guyon, I., A. Elisseeff (2003). An introduction to variable and feature selection. The Journal of

Machine Learning Research 3, 1157-1182

Haddad, I., K. Hiller, E. Frimmersdorf, B. Benkert, D. Schomburg, D. Jahn (2009). An emergent

self-organizing map based analysis pipeline for comparative metabolome studies. In

Silico Biol 9, 163-78 doi:2009090014 [pii]

Hamerly, G., C. Elkan (2003). Learning the k in k-means. Advances in Neural Information

Processing Systems 16,

Hartigan, J. A., M. A. Wong (1979). Algorithm AS 136: A k-means clustering algorithm.

Journal of the Royal Statistical Society Series C (Applied Statistics) 28, 100-108

Hastie, T., R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, R. Tibshirani (2009). The elements

of statistical learning, vol 2, Springer.

Heather, L. C., X. Wang, J. A. West, J. L. Griffin (2013). A practical guide to metabolomic

profiling as a discovery tool for human heart disease. Journal of molecular and cellular

cardiology 55, 2-11

Heinzmann, S. S., et al. (2010). Metabolic profiling strategy for discovery of nutritional

biomarkers: proline betaine as a marker of citrus consumption. The American journal of

clinical nutrition 92, 436-443

55

Henneges, C., et al. (2009). Prediction of breast cancer by profiling of urinary RNA metabolites

using Support Vector Machine-based feature selection. BMC cancer 9, 104

Holmans, P. (2010). Statistical methods for pathway analysis of genome-wide data for

association with complex genetic traits. Adv Genet 72, 141-79 doi:10.1016/B978-0-12-

380862-2.00007-2 B978-0-12-380862-2.00007-2 [pii]

Horai, H., et al. (2010). MassBank: a public repository for sharing mass spectral data for life

sciences. Journal of mass spectrometry 45, 703-714

Hou, Y., et al. (2012). Microbial strain prioritization using metabolomics tools for the discovery

of natural products. Anal Chem 84, 4277-83 doi:10.1021/ac202623g

http://pubs.acs.org/doi/pdf/10.1021/ac202623g

Huang, J. Z., M. K. Ng, H. Rong, Z. Li (2005). Automated variable weighting in k-means type

clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27, 657-

668

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters 31,

651-666

Jain, A. K., M. N. Murty, P. J. Flynn (1999). Data clustering: a review. ACM computing surveys

(CSUR) 31, 264-323

Jansen, J. J., H. C. Hoefsloot, H. F. Boelens, J. van der Greef, A. K. Smilde (2004). Analysis of

longitudinal metabolomics data. Bioinformatics 20, 2438-46

doi:10.1093/bioinformatics/bth268 bth268 [pii]

Johnson, R. A., D. W. Wichern (2007). Applied multivariate statistical analysis, 6th edn. Upper

Saddle River, N.J., Pearson Prentice Hall.

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241-254

Jolliffe, I. (2005). Principal component analysis, Wiley Online Library.

Kaddurah-Daouk, R., K. R. Krishnan (2009). Metabolomics: a global biochemical approach to

the study of central nervous system diseases. Neuropsychopharmacology 34, 173-86

doi:10.1038/npp.2008.174 npp2008174 [pii]

Kanehisa, M. (2002). The KEGG database. Novartis Found Symp 247, 91-101; discussion 101-3,

119-28, 244-52

Kang, S. M., et al. (2011). (1)H nuclear magnetic resonance based metabolic urinary profiling of

patients with ischemic heart failure. Clin Biochem 44, 293-9

doi:10.1016/j.clinbiochem.2010.11.010 S0009-9120(10)00511-4 [pii]

Kell, D. B., M. Brown, H. M. Davey, W. B. Dunn, I. Spasic, S. G. Oliver (2005). Metabolic

footprinting and systems biology: the medium is the message. Nature Reviews

Microbiology 3, 557-565 doi:10.1038/nrmicro1177

Khatri, P., M. Sirota, A. J. Butte (2012). Ten years of pathway analysis: current approaches and

outstanding challenges. PLoS Comput Biol 8, e1002375

doi:10.1371/journal.pcbi.1002375 PCOMPBIOL-D-11-00449 [pii]

Kilkenny, C., et al. (2009). Survey of the quality of experimental design, statistical analysis and

reporting of research using animals. PloS one 4, e7824

Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE 78, 1464-1480

Kohonen, T. (1998). The self-organizing map. Neurocomputing 21, 1-6

Kutner, M. H. (2005). Applied linear statistical models, 5th edn. Boston, McGraw-Hill Irwin.

Lauridsen, M., S. H. Hansen, J. W. Jaroszewski, C. Cornett (2007). Human urine as test material

in 1H NMR-based metabonomics: recommendations for sample preparation and storage.

Analytical Chemistry 79, 1181-1186

56

Li, F., J. Wang, L. Nie, W. Zhang (2012). Computational Methods to Interpret and Integrate

Metabolomic Data, INTECH Open Access Publisher.

Li, H., Y. Liang, Q. Xu (2009). Support vector machines and its applications in chemistry.

Chemometrics and Intelligent Laboratory Systems 95, 188-198

Li, X., X. Lu, J. Tian, P. Gao, H. Kong, G. Xu (2009). Application of fuzzy c-means clustering

in data analysis of metabolomics. Analytical Chemistry 81, 4468-4475

Luo, W., C. Brouwer (2013). Pathview: an R/Bioconductor package for pathway-based data

integration and visualization. Bioinformatics 29, 1830-1831

Mahadevan, S., S. L. Shah, T. J. Marrie, C. M. Slupsky (2008). Analysis of metabolomic data

using support vector machines. Analytical Chemistry 80, 7562-7570

Martens, H. (1992). Multivariate calibration, John Wiley & Sons.

Marzetti, E., et al. (2014). Patterns of circulating inflammatory biomarkers in older persons with

varying levels of physical performance: a partial least squares-discriminant analysis

approach. Frontiers in medicine 1,

Matthiesen, R., SpringerLink (Online service) (2010). Bioinformatics Methods in Clinical

Research Methods in Molecular Biology, Methods and Protocols,

Milliken, G. A., D. E. Johnson (2009). Analysis of messy data, 2nd edn. Boca Raton, CRC Press.

Milone, D. H., G. Stegmayer, M. López, L. Kamenetzky, F. Carrari (2014). Improving clustering

with metabolic pathway data. BMC Bioinformatics 15, 101

Montgomery, D. C. (2008). Design and analysis of experiments, John Wiley & Sons.

Nguyen, D. V., D. M. Rocke (2002). Tumor classification by partial least squares using

microarray gene expression data. Bioinformatics 18, 39-50

Nicholson, J. K., J. C. Lindon, E. Holmes (1999). 'Metabonomics': understanding the metabolic

responses of living systems to pathphysiological stimuli via multivariate statistical

analysis of biological NMR spectroscopic data. Xenobiotica 29, 1181-1189

Nin, N., J. Izquierdo-García, J. Lorente (2012). The metabolomic approach to the diagnosis of

critical illness Annual Update in Intensive Care and Emergency Medicine 2012.(pp43-

52). Springer.

Nueda, M. J., et al. (2007). Discovering gene expression patterns in time course microarray

experiments by ANOVA-SCA. Bioinformatics 23, 1792-800

doi:10.1093/bioinformatics/btm251 btm251 [pii]

O'Sullivan, A., M. J. Gibney, L. Brennan (2011). Dietary intake patterns are reflected in

metabolomic profiles: potential role in dietary assessment studies. The American journal

of clinical nutrition 93, 314-321

Oliver, S. G. (2002). Functional genomics: lessons from yeast. Philos Trans R Soc Lond B Biol

Sci 357, 17-23 doi:10.1098/rstb.2001.1049

Oliver, S. G., M. K. Winson, D. B. Kell, F. Baganz (1998). Systematic functional analysis of the

yeast genome. Trends Biotechnol 16, 373-8 doi:S0167-7799(98)01214-1 [pii]

Papin, J. A., J. Stelling, N. D. Price, S. Klamt, S. Schuster, B. O. Palsson (2004). Comparison of

network-based pathway analysis methods. Trends Biotechnol 22, 400-5

doi:10.1016/j.tibtech.2004.06.010 S0167-7799(04)00175-1 [pii]

Patel, K. N., J. K. Patel, M. P. Patel, G. C. Rajput, H. A. Patel (2010). Introduction to hyphenated

techniques and their applications in pharmacy. Pharmaceutical methods 1, 2-13

Pauling, L., A. B. Robinson, R. Teranishi, P. Cary (1971). Quantitative analysis of urine vapor

and breath by gas-liquid partition chromatography. Proc Natl Acad Sci U S A 68, 2374-6

57

Poroyko, V., et al. (2011). Diet creates metabolic niches in the "immature gut" that shape

microbial communities. Nutr Hosp 26, 1283-95 doi:10.1590/S0212-16112011000600015

S0212-16112011000600015 [pii]

Putri, S. P., et al. (2013). Current metabolomics: practical applications. J Biosci Bioeng 115, 579-

89 doi:10.1016/j.jbiosc.2012.12.007 S1389-1723(12)00503-8 [pii]

Ramadan, Z., D. Jacobs, M. Grigorov, S. Kochhar (2006). Metabolic profiling using principal

component analysis, discriminant partial least squares, and genetic algorithms. Talanta

68, 1683-1691

Raman, K., N. Chandra (2009). Flux balance analysis of biological systems: applications and

challenges. Brief Bioinform 10, 435-49 doi:10.1093/bib/bbp011 bbp011 [pii]

Riter, L. S., O. Vitek, K. M. Gooding, B. D. Hodge, R. K. Julian (2005). Statistical design of

experiments as a tool in mass spectrometry. Journal of mass spectrometry 40, 565-579

Rocke, D. M. Design and analysis of experiments with high throughput biological assay data. In:

Seminars in cell & developmental biology, 2004. vol 15. Elsevier, p 703-713

Savorani, F., G. Tomasi, S. B. Engelsen (2010). icoshift: A versatile tool for the rapid alignment

of 1D NMR spectra. Journal of Magnetic Resonance 202, 190-202

Scalbert, A., et al. (2009). Mass-spectrometry-based metabolomics: limitations and

recommendations for future progress with particular focus on nutrition research.

Metabolomics 5, 435-458

Schilling, C. H., S. Schuster, B. O. Palsson, R. Heinrich (1999). Metabolic pathway analysis:

basic concepts and scientific applications in the post ‐genomic era. Biotechnology

progress 15, 296-303

Schölkopf, B., A. Smola, K.-R. Müller (1998). Nonlinear component analysis as a kernel

eigenvalue problem. Neural computation 10, 1299-1319

Slupsky, C. M., et al. (2007). Investigations of the effects of gender, diurnal variation, and age in

human urinary metabolomic profiles. Analytical Chemistry 79, 6995-7004

Smilde, A. K., J. J. Jansen, H. C. Hoefsloot, R. J. Lamers, J. van der Greef, M. E. Timmerman

(2005). ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing

designed metabolomics data. Bioinformatics 21, 3043-8

doi:10.1093/bioinformatics/bti476 bti476 [pii]

Smilde, A. K., et al. (2010). Dynamic metabolomic data analysis: a tutorial review.

Metabolomics 6, 3-17 doi:10.1007/s11306-009-0191-1

Smith, C. A., et al. (2005). METLIN: a metabolite mass spectral database. Therapeutic drug

monitoring 27, 747-751

Smolinska, A., L. Blanchet, L. M. Buydens, S. S. Wijmenga (2012). NMR and pattern

recognition methods in metabolomics: from data acquisition to biomarker discovery: a

review. Anal Chim Acta 750, 82-97 doi:10.1016/j.aca.2012.05.049 S0003-

2670(12)00815-X [pii]

Steinley, D., M. J. Brusco (2008). Selection of variables in cluster analysis: An empirical

comparison of eight procedures. Psychometrika 73, 125-144

Steuer, R. (2007). Computational approaches to the topology, stability and dynamics of

metabolic networks. Phytochemistry 68, 2139-51 doi:10.1016/j.phytochem.2007.04.041

S0031-9422(07)00291-9 [pii]

Stretch, C., et al. (2012). Prediction of skeletal muscle and fat mass in patients with advanced

cancer using a metabolomic approach. The Journal of nutrition 142, 14-21

58

Szczesniak, R. D., G. L. McPhail, L. L. Duan, M. Macaluso, R. S. Amin, J. P. Clancy (2013). A

semiparametric approach to estimate rapid lung function decline in cystic fibrosis. Annals

of epidemiology 23, 771-777

Szymanska, E., E. Saccenti, A. K. Smilde, J. A. Westerhuis (2012). Double-check: validation of

diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics 8, 3-16

doi:10.1007/s11306-011-0330-3 330 [pii]

Theodoridis, G. A., H. G. Gika, E. J. Want, I. D. Wilson (2012). Liquid chromatography–mass

spectrometry based global metabolite profiling: a review. Analytica chimica acta 711, 7-

16

Tibshirani, R., G. Walther, T. Hastie (2001). Estimating the number of clusters in a data set via

the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical

Methodology) 63, 411-423

Timmerman, M. E., E. Ceulemans, K. De Roover, K. Van Leeuwen (2013). Subspace K-means

clustering. Behavior research methods 45, 1011-1023

Timmerman, M. E., E. Ceulemans, H. A. Kiers, M. Vichi (2010). Factorial and reduced K-means

reconsidered. Computational Statistics & Data Analysis 54, 1858-1871

Timmerman, M. E., H. C. Hoefsloot, A. K. Smilde, E. Ceulemans (2015). Scaling in ANOVA-

simultaneous component analysis. Metabolomics, 1-12

Tomar, N., R. K. De (2013). Comparing methods for metabolic network analysis and an

application to Metabolic Engineering. Gene,

Tomasi, G., F. van den Berg, C. Andersson (2004). Correlation optimized warping and dynamic

time warping as preprocessing methods for chromatographic data. Journal of

Chemometrics 18, 231-241

Trygg, J., S. Wold (2002). Orthogonal projections to latent structures (O‐PLS). Journal of

Chemometrics 16, 119-128

Ultsch, A. (2003). U*-matrix: a tool to visualize clusters in high dimensional data, Fachbereich

Mathematik und Informatik.

van den Berg, R. A., H. C. Hoefsloot, J. A. Westerhuis, A. K. Smilde, M. J. van der Werf (2006).

Centering, scaling, and transformations: improving the biological information content of

metabolomics data. BMC genomics 7, 142

VanDyke, R., Y. Ren, H. J. Sucharew, M. Miodovnik, B. Rosenn, J. C. Khoury (2012).

Characterizing maternal glycemic control: a more informative approach using

semiparametric regression. Journal of Maternal-Fetal and Neonatal Medicine 25, 15-19

Velagapudi, V. R., et al. (2010). The gut microbiota modulates host energy and lipid metabolism

in mice. Journal of lipid research 51, 1101-1112

Vettukattil, R. (2015). Preprocessing of Raw Metabonomic Data. Metabonomics: Methods and

Protocols, 123-136

Vichi, M., H. A. Kiers (2001). Factorial k-means analysis for two-way data. Computational

Statistics & Data Analysis 37, 49-64

Wang-Sattler, R., et al. (2012). Novel biomarkers for pre-diabetes identified by metabolomics.

Molecular Systems Biology 8, doi:10.1038/msb.2012.43

Wetmore, D. R., et al. (2010). Metabolomic Profiling Reveals Biochemical Pathways and

Biomarkers Associated with Pathogenesis in Cystic Fibrosis Cells. Journal of Biological

Chemistry 285, 30516-30522 doi:10.1074/jbc.M110.140806

Wiechert, W. (2002). Modeling and simulation: tools for metabolic engineering. J Biotechnol 94,

37-63 doi:S0168165601004187 [pii]

59

Wishart, D. S. (2007). Current progress in computational metabolomics. Briefings in

Bioinformatics 8, 279-293

Wishart, D. S., et al. (2013). HMDB 3.0--The Human Metabolome Database in 2013. Nucleic

Acids Res 41, D801-7 doi:10.1093/nar/gks1065 gks1065 [pii]

Wold, H. (1966). Estimation of principal components and related models by iterative least

squares. Multivariate analysis 1, 391-420

Wold, S., A. Ruhe, H. Wold, I. Dunn, WJ (1984). The collinearity problem in linear regression.

The partial least squares (PLS) approach to generalized inverses. SIAM Journal on

Scientific and Statistical Computing 5, 735-743

Wold, S., M. Sjöström, L. Eriksson (2001). PLS-regression: a basic tool of chemometrics.

Chemometrics and Intelligent Laboratory Systems 58, 109-130

Xi, Y., D. M. Rocke (2008). Baseline correction for NMR spectroscopic metabolomics data

analysis. BMC Bioinformatics 9, 324

Xia, J., D. I. Broadhurst, M. Wilson, D. S. Wishart (2012). Translational biomarker discovery in

clinical metabolomics: an introductory tutorial. Metabolomics 9, 280-299

doi:10.1007/s11306-012-0482-9

Xia, J., R. Mandal, I. V. Sinelnikov, D. Broadhurst, D. S. Wishart (2012). MetaboAnalyst 2.0—a

comprehensive server for metabolomic data analysis. Nucleic Acids Research 40, W127-

W133

Xia, J., N. Psychogios, N. Young, D. S. Wishart (2009). MetaboAnalyst: a web server for

metabolomic data analysis and interpretation. Nucleic Acids Research 37, W652-W660

Xing, E. P., M. I. Jordan, S. Russell, A. Y. Ng Distance metric learning with application to

clustering with side-information. In: Advances in Neural Information Processing

Systems, 2002. p 505-512

Yan, M., K. Ye (2007). Determining the number of clusters using the weighted gap statistic.

Biometrics 63, 1031-1037

Yang, C., Z. He, W. Yu (2009). Comparison of public peak detection algorithms for MALDI

mass spectrometry data analysis. BMC Bioinformatics 10, 4

Zhang, J. D., S. Wiemann (2009). KEGGgraph: a graph approach to KEGG PATHWAY in R

and bioconductor. Bioinformatics 25, 1470-1471

Zhang, S., G. N. Gowda, V. Asiago, N. Shanaiah, C. Barbas, D. Raftery (2008). Correlative and

quantitative 1 H NMR-based metabolomics reveals specific metabolic pathway

disturbances in diabetic rats. Analytical biochemistry 383, 76-84

63

Table 1. ORA analysis table

Metabolites on list

Metabolites not on list

Subtotal

Metabolites in the pathway

a

b

a+b

Metabolites not in the pathway

c

d

c+d

Subtotal

a+c

b+d

64

Table 2. Summary of commonly used metabolomics data analysis methods

Type of

analysis

Basic Methods

Goal

Application

Input

Output

Softwarea

Basic

statistical

testing

Fold Change

Biomarker

discovery;

feature selection

for supervised

learning

Fold change of metabolite

concentrations between two

groups

Data tables with class

membership: each row

represents one subject and

each column represents

concentration of a

metabolite/ MS and NMR

peak lists or spectral bins

Lists of selected

metabolites with p-

values, volcano plot

(for two groups)

MetaboAnalystb,

R and MATLAB

Statistical

Testing

Two sample

tests (e.g. t-

test)

Identify significantly

different expressed

metabolites between two

groups

ANOVA

with post-

hoc analysis

Identify significantly

different expressed

metabolites for multiple

groups

Unsupervised

learning

Principal Component

Analysis (PCA)

Data grouping

and

visualization

Reduce dimensionality and

check clusters visually.

Same as above, but no

need for class

memberships

Scores and loading

plots for visualization

MetaboAnalyst,

R and MATLAB

Clustering

K-means,

fuzzy c-

means, K-

means in

reduced

space

Group the subjects into k

different clusters.

Cluster labels for

each subject

Hierarchical

Show how subjects form

different clusters.

Heatmap,

dendrogram

Self-organizing Map

(SOM)

Reduce dimensionality and

check clusters visually.

SOM

65

Type of

analysis

Basic Methods

Goal

Application

Input

Output

Softwarea

Supervised

learning -

Classification

Support Vector Machine

(SVM)

Predict class

membership

(disease

diagnostic),

biomarker

discovery

All of them are classification

methods suitable for

metabolomics research; need

to compare their performance

in real data analysis

Same as above, but class

memberships are needed

A prediction model

with selected

metabolites as

predictors

MetaboAnalyst,

R,and MATLAB

Partial Least Square –

Discriminant Analysis

(PLS - DA), OPLS-DA

and SPLS-DA

Supervised

learning -

Regression

Support Vector

Regression (SVR)

Predict

continuous

variables,

calibration,

biomarker

discovery

All of them are regression

methods suitable for

metabolomics research; need

to compare their performance

in real data analysis

The response variables

should be continuous

vector or matrix; for

calibration problem, the

response variables are

concentration and

covariates are spectral

intensity

A prediction model

with selected

variables as

predictors

Partial Least Square

(PLS), OPLS and SPLS

Pathway

analysis

Over-representation

Analysis (ORA)

Find

biologically

meaningful

metabolite sets

or pathways that

are associated

with certain

diseases or

characters

Find related pathways or

metabolite sets using

statistical testing

Significant metabolites

list and reference

pathways

A list of selected

pathways with their

p-values or FDRs

MetaboAnalyst,

Bioconductorc

Functional class scoring

(FCS) /

Enrichment Analysis

Metabolites list with

concentration and

reference pathways or

metabolite sets

Elementary modes/

Extreme pathways

Simulation and

pathway

reconstruction

Analyze or reconstruct

metabolic pathways

Pathway, stoichiometric

matrix

Several different

possible flux

distributions

MATLAB

66

Type of

analysis

Basic Methods

Goal

Application

Input

Output

Softwarea

Flux Balance Analysis

Find the flux distribution by

maximizing/ minimizing a

reaction based on some

constraints

Pathway, stoichiometric

matrix, constraints, and an

objective function

One unique flux

distribution

MATLAB

Kinetic reaction network

model

Simulate the dynamic

behavior of metabolites

reaction networks

Pathway, stoichiometric

matrix, Kinetic parameter

and rate equation

Simulation results of

network kinetics

Time course

data analysis

Analysis

of

variance

(ANOVA)

Fixed effects

ANOVA

Test time

dependent

effects; compare

time profiles for

metabolites of

different

subjects

Two or more given factors

Data tables with labels

(class membership

indicators): each row

represents the observation

value of one subject at

one time point and each

column represents

metabolites concentration/

MS and NMR peak lists

or spectral bins

Testing results of

time dependent

effects for each

metabolite, time

profile plots

MetaboAnalyst,

R

Repeated

measures

ANOVA method when we

have multiple within subjects

measurements (measured at

different time points)

ANOVA-simultaneous

component analysis

(ASCA)

ANOVA for multivariate

response

Scores and loading

plots of sub models,

time profiles plots

Functional based method

(e.g. smoothing splines

mixed effects model)

Estimate sets of curves

(metabolic profiles) and

quantify their difference

Testing results of the

differences in time

profiles between

groups for each

metabolite

Time series models (e.g.

ARMA model)

Model the

dynamic

property (e.g.

biorhythm) of

metabolites data

Long time series data

modeling

A model that best

describe the dynamic

property of

underlying biological

process

67

a. We only picked some commonly used software packages or web tools in metabolomics. b. MetaboAnalyst is an easy-to-use web based tool that

covers most basic computational and statistical methods for metabolomics (Xia, Mandal, Sinelnikov, Broadhurst, Wishart 2012; Xia, Psychogios,

Young, Wishart 2009). c. Bioconductor was built on R and provided many useful packages for bioinformatics (Gentleman, Carey, Bates, Bolstad,

Dettling, Dudoit, Ellis et al. 2004), including packages for metabolic pathway analysis (Luo, Brouwer 2013; Zhang, Wiemann 2009).