Content uploaded by Michele Forina

Author content

All content in this area was uploaded by Michele Forina on Nov 19, 2017

Content may be subject to copyright.

CHAPTER

2

c0002 Data Analysis and Chemometrics

Paolo Oliveri, Michele Forina

Department of Drug and Food Chemistry and Technology, University of Genoa,

Via Brigata Salerno, Genoa, Italy

OUTLINE

2.1. Introduction 25

2.1.1. From Data to Information 25

2.2. From Univariate to Multivariate 27

2.2.1. Histograms 27

2.2.2. Normality Tests 36

2.2.3. ANOVA 37

2.2.4. Radar Charts 40

2.3. Multivariate Data Analysis 41

2.3.1. Principal-Component Analysis 41

2.3.2. Signal Pre-Processing 44

2.3.3. Supervised Data Analysis

and Validation 45

2.3.4. Supervised Qualitative Modeling 46

2.3.5. Supervised Quantitative Modeling 54

2.3.6. Artiﬁcial Neural Networks 57

s0010 2.1. INTRODUCTION

s0015 2.1.1. From Data to Information

p0020 Advances in technology and the increasing

availability of powerful instrumentation now

offer analytical food chemists the possibility for

obtaining high amounts of data on each sample

analyzed, in a reasonable eoften negligible e

time frame (Valca

´rcel and Ca

´rdenas, 2005).

p0025 Often, in fact, a single analysis may provide

a considerable number of measured quantities,

generally of the same nature. For instance, gas

chromatographic (GC) analysis of fatty acid

methyl esters allows us to quantify, with a single

chromatogram, the fatty acid composition of

a vegetable oil sample (American Oil Chemist’s

Society, 1998). Spectroscopic techniques as well

may supply, with a single and rapid analysis

on a sample, multiple data of homogeneous

nature: in fact, a spectrum can be considered as

a data vector, in which the order of the variables

(e.g., absorbances at consecutive wavelengths)

has a physical meaning (Oliveri et al., 2011

½

AU1

).

Chemical Analysis of Food: Techniques and Applications

DOI: 10.1016/B978-0-12-384862-8.00002-9 Copyright Ó2012 Elsevier Inc. All rights reserved.

25

10002-PICO-9780123848628

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),

Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher

and is confidential until formal publication.

p0030 In other cases, a set of samples can be

described by a number of heterogeneous chem-

ical and physical parameters at the same time.

For example, a global analytical characteriza-

tion of a tomato sauce may involve the quanti-

ﬁcation of color and rheological parameters as

well as pH and chemical composition and e

possibly ea number of sensorial responses

(Sharoba et al., 2005). Also in such cases, each

sample may be described by a data vector,

but without any implication with respect to

the order of the variables. Instead, differences

in variable magnitude and scale between

different variables may affect data analysis

if a proper pre-processing approach is not

followed.

p0035 The availability of large sets of data does not

mean at the immediate time availability of infor-

mation promptly accessible to the sample

analyzed: usually, in fact, a number of steps

are required to extract and properly interpret

the potential information embodied within the

data (Martens and Kohler, 2008).

p0040 A deep understanding of the nature of

analytical data is the ﬁrst basic step for any

proper data treatment, because different data

types usually require different processing strat-

egies, which closely depend on their nature and

origin. For this reason, the data analyst should

always have a complete awareness of the

problem under study and about the whole

analytical process from which data derive e

from the sampling to the instrumental analysis.

Such knowledge is fundamental: it makes the

difference between a chemometrician and

a mathematician. A chemometrician is, ﬁrst of

all, a chemist, who is acquainted with his

data, and utilizes mathematical methods for

the conversion of numerical records into rele-

vant chemical information.

p0045 The analytical food chemist William Sealy

Gosset (1876e1937), who worked at the

Arthur Guinness & Son brewery of Dublin,

can be considered as one of the fathers of

chemometrics. In fact, he studied a number

of statistical tools and adapted them to better

solve actual chemical problems. He had to

present his studies using a pseudonymous,

since his company did not permit him to

publish any data. Considering himself as

a modest contributor in the ﬁeld, rather

than a statistician, he adopted the pen

name Student. His most famous work was

on the deﬁnition of the probability distribu-

tion that is commonly referred to as the

Student’s tdistribution (Student, 1908).

p0050The term chemometrics was used for the ﬁrst

time by Svante Wold, in 1972, for identifying the

discipline that performs the extraction of useful

chemical information from complex experi-

mental systems (Wold, 1972).

p0055Statistics offers a number of helpful tools that

can be used for converting data into informa-

tion. Univariate methods, which consider one

variable at a time, independently of the others,

have been and are still extensively used for

such purposes. Nonetheless, they usually

supply just partial answers to the problems

under study, since they underutilize the

potential for discovering global information

embodied in the data. For instance, they are

not able to take into account inter-correlation

between variables ea feature that can be very

informative, if recognized and properly

interpreted.

p0060Multivariate strategies are able to take into

account such an aspect, allowing a more

complete interpretation of data structures.

However, in spite of their big potential, multi-

variate methods are generally less used than

univariate tools.

p0065On the other hand, a number of people try

multivariate analysis as the last-ditch resort,

when nothing seems to provide the desired

results, pretending that chemometrics provide

valuable information from data that do not

contain any informative feature at all.

p0070Such demeanor is very hazardous especially

when complex methods are being used, because

there may be the risk of employing chance

2. DATA ANALYSIS AND CHEMOMETRICS26

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),

Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher

and is confidential until formal publication.

correlations to develop models with good

performances only on appearance enamely,

on the same samples used for model building

ebut with very poor prediction ability on new

samples: this is the so-called overﬁtting. To over-

come such a possibility, a proper validation of

models is always required. In particular, the

more complex the technique applied, the deeper

the validation recommended.

p0075 For these reasons, a good understanding of

the characteristics of the methods employed

for data processing is always advantageous

as well.

p0080 In this chapter, an overview of the chemomet-

ric techniques most commonly used for data

analysis in analytical food chemistry will be pre-

sented, highlighting potentials and limits of

each one.

s0020 2.2. FROM UNIVARIATE

TO MULTIVARIATE

p0085 A bidimensional table is probably the most

typical way to arrange, present, and store

analytical data: conventionally, in chemomet-

rics, each row usually represents one of the

samples analyzed, while each column corre-

sponds to one of the variables measured.

p0090 As an example, Table 2.1 reports the data set

red wines

½

AU2

, which consists of 27 chemical and

physical parameters measured on 90 wine

samples, belonging to three Italian denomina-

tions of origin from the same region (Piedmont):

Barolo, Grignolino, and Barbera. The original

data set was composed of 178 samples (Forina

et al., 1986).

p0095 Table 2.1 contains also additional informa-

tion, which is usually not processed but which

may be extremely helpful in the ﬁnal under-

standing and interpretation of the results. In

particular, the two heading lines contain the

numbers and the names of the variables, which

are an additional information for columns,

while the two heading columns include the

names identifying the samples and their class,

which represent an additional information for

rows.

p0100It is easy to guess that such data enclose

a great deal of potential information. Anyway,

the simple visual inspection of the table,

which contains a considerable number of

records, does not provide directly any valu-

able information about the samples analyzed.

A conversion from data into information is

necessary.

p0105Univariate methods are still the most used

in many cases, although they generally offer

only a very limitative vision of the global

situation.

s00252.2.1. Histograms

p0110A good way to extract information from data

is to use graphical tools. Among them, histo-

grams are probably the most widely employed

(Chambers et al., 1983).

p0115To build a histogram, the range of interest

of the variable under study is divided into

a number of regular adjacent intervals. For

each interval, the contribution of the measured

samples is graphically displayed by a vertical

rectangle, whose area is proportional to the

frequency (i.e., the number of observations)

within that interval. Consequently, the height

of each rectangle is equal to the frequency

divided by the interval width, so that it has

the dimension of a frequency density.

p0120Frequently, such frequency values are

normalized, dividing each of them by the total

number of observations, thus obtaining rela-

tive values. It follows that, in such cases, the

sumoftheareasofalltherectanglesei.e.,

the sum of all the relative frequencies eis

equal to 1.

p0125The frequency distribution visualized by

a histogram can be used to estimate the prob-

ability distribution of the variable under

study and to make deductions about the

samples.

2.2. FROM UNIVARIATE TO MULTIVARIATE 27

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),

Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher

and is confidential until formal publication.

t0010 TABLE 2.1 Red-Wine Data Set, and Basic Statistical Parameters

Name Category

Alcohol

(% abv)

Sugar-free

extract (g/l)

Fixed

acidity (g/l)

Tartaric

acid (g/l)

Malic

acid (g/l)

Uronic

acids (mg/l) pH

Ash

(g/l)

Alcalinity

of ash

(meq/l)

Potassium

(mg/l)

Calcium

(mg/l)

Magnesium

(mg/l)

Phosphate

(g/l)

OLO0171 Barolo 14.23 24.82 73.10 1.21 1.71 0.72 3.38 2.43 15.60 950 62 127 320

OLO0271 Barolo 13.20 26.30 72.80 1.84 1.78 0.71 3.30 2.14 11.20 765 75 100 395

OLO0371 Barolo 13.16 26.30 68.50 1.94 2.36 0.84 3.48 2.67 18.60 936 70 101 497

OLO0471 Barolo 14.37 25.85 74.90 1.59 1.95 0.72 3.43 2.50 16.80 985 47 113 580

OLO0571 Barolo 13.24 26.05 83.50 1.30 2.59 1.10 3.42 2.87 21.00 1088 70 118 408

OLO0671 Barolo 14.20 28.40 79.90 2.14 1.76 0.96 3.39 2.45 15.20 868 71 112 418

OLO0771 Barolo 14.39 27.02 64.30 1.64 1.87 0.95 3.42 2.45 14.60 889 67 96 306

OLO0871 Barolo 14.06 26.40 73.50 1.33 2.15 1.14 3.54 2.61 17.60 894 50 121 502

OLO0971 Barolo 14.83 26.80 69.50 1.82 1.64 0.67 3.30 2.17 14.00 765 49 97 440

OLO1071 Barolo 13.86 27.00 68.50 1.92 1.35 0.67 3.27 2.27 16.00 794 51 98 391

OLO1171 Barolo 14.10 26.08 72.50 1.64 2.16 0.62 3.31 2.30 18.00 838 61 105 399

OLO1271 Barolo 14.12 28.35 72.90 1.51 1.48 0.96 3.20 2.32 16.80 827 60 95 424

OLO1371 Barolo 13.75 30.25 75.10 1.92 1.73 0.64 3.18 2.41 16.00 752 65 89 453

OLO1471 Barolo 14.75 30.40 98.90 2.08 1.73 0.72 3.01 2.39 11.40 910 46 91 510

OLO1571 Barolo 14.38 27.10 72.30 1.95 1.87 0.67 3.20 2.38 12.00 927 29 102 523

OLO1671 Barolo 13.63 27.15 69.60 1.48 1.81 0.67 3.47 2.70 17.20 905 28 112 385

OLO1771 Barolo 14.30 27.90 74.90 1.41 1.92 0.82 3.40 2.72 20.00 860 108 120 513

OLO1871 Barolo 13.83 26.30 64.90 1.93 1.57 0.68 3.43 2.62 20.00 905 68 115 419

OLO1971 Barolo 14.19 26.40 72.00 1.85 1.59 0.82 3.38 2.48 16.50 964 86 108 488

OLO0173 Barolo 13.64 27.72 91.50 1.35 3.10 0.82 3.30 2.56 15.20 1038 111 116 402

OLO0273 Barolo 14.06 25.32 71.10 1.34 1.63 1.00 3.47 2.28 16.00 905 79 126 323

OLO0373 Barolo 12.93 28.80 102.10 1.05 3.80 0.89 3.26 2.65 18.60 915 79 102 294

OLO0473 Barolo 13.71 27.63 80.00 2.23 1.86 1.21 3.33 2.36 16.60 815 89 101 476

OLO0573 Barolo 12.85 25.80 69.60 1.54 1.60 0.79 3.45 2.52 17.80 958 101 95 415

OLO0673 Barolo 13.50 25.00 81.60 1.55 1.81 0.95 3.42 2.61 20.00 992 62 96 476

OLO0773 Barolo 13.05 25.72 78.30 1.15 2.05 1.08 3.57 3.22 25.00 1095 63 124 536

OLO0873 Barolo 13.39 27.10 72.30 1.52 1.77 1.05 3.46 2.62 16.10 936 68 93 395

10002-PICO-9780123848628

and is confidential until formal publication.

OLO0973 Barolo 13.30 22.70 68.30 1.74 1.72 1.06 3.44 2.14 17.00 882 52 94 434

OLO1073 Barolo 13.87 29.30 68.30 1.38 1.90 0.75 3.42 2.80 19.40 1085 68 107 396

OLO1173 Barolo 14.02 25.20 69.60 1.71 1.68 0.79 3.26 2.21 16.00 780 62 96 510

GRI0170 Grignolino 12.37 18.30 90.10 2.80 0.94 0.73 3.11 1.36 10.60 580 77 88 296

GRI0270 Grignolino 12.33 22.90 72.20 2.25 1.10 0.69 3.26 2.28 16.00 715 85 101 365

GRI0370 Grignolino 12.64 23.90 95.70 1.93 1.36 1.06 3.19 2.02 16.80 688 83 100 395

GRI0470 Grignolino 13.67 22.20 64.80 2.20 1.25 0.74 3.40 1.92 18.00 725 51 94 301

GRI0570 Grignolino 12.37 23.50 70.00 2.06 1.13 0.72 3.30 2.16 19.00 785 73 87 422

GRI0670 Grignolino 12.17 23.03 65.70 1.84 1.45 0.72 3.35 2.53 19.00 790 62 104 411

GRI0770 Grignolino 12.37 26.80 62.70 1.70 1.21 0.88 3.40 2.56 18.10 978 55 98 310

GRI0870 Grignolino 13.11 23.70 80.00 1.40 1.01 0.77 3.10 1.70 15.00 730 80 78 297

GRI0970 Grignolino 12.37 20.90 63.70 1.94 1.17 0.67 3.40 1.92 19.60 785 40 78 212

GRI0171 Grignolino 13.34 23.72 70.00 2.02 0.94 1.09 3.26 2.36 17.00 760 64 110 451

GRI0271 Grignolino 12.21 22.70 90.70 3.62 1.19 0.94 3.14 1.75 16.80 795 134 151 448

GRI0371 Grignolino 12.29 21.40 55.60 1.43 1.61 0.87 3.54 2.21 20.40 682 102 103 324

GRI0471 Grignolino 13.86 25.25 59.50 1.27 1.51 1.09 3.63 2.67 25.00 785 63 86 383

GRI0571 Grignolino 13.49 22.30 60.90 1.74 1.66 0.67 3.44 2.24 24.00 680 60 87 300

GRI0671 Grignolino 12.99 26.10 50.50 1.42 1.67 1.24 3.52 2.60 30.00 974 55 139 473

GRI0771 Grignolino 11.96 24.50 65.70 2.18 1.09 0.73 3.40 2.30 21.00 681 98 101 366

GRI0871 Grignolino 11.66 20.30 61.70 1.70 1.88 0.60 3.30 1.92 16.00 785 52 97 312

GRI0971 Grignolino 13.03 23.50 78.60 1.90 0.90 0.76 3.30 1.71 16.00 790 57 86 396

GRI0172 Grignolino 11.84 26.40 108.70 1.70 2.89 0.91 3.11 2.23 18.00 790 71 112 350

GRI0272 Grignolino 12.33 20.60 58.70 2.41 0.99 0.84 3.32 1.95 14.80 680 124 136 438

GRI0372 Grignolino 12.70 27.15 93.30 1.46 3.87 1.11 3.19 2.40 23.00 890 110 101 321

GRI0472 Grignolino 12.00 23.20 58.40 1.88 0.92 0.82 3.30 2.00 19.00 680 63 86 408

GRI0572 Grignolino 12.72 22.90 58.40 1.40 1.81 0.81 3.50 2.20 18.80 890 83 86 418

GRI0672 Grignolino 12.08 23.50 56.90 1.33 1.13 0.71 3.65 2.51 24.00 980 85 78 215

GRI0772 Grignolino 13.05 25.50 104.80 1.64 3.86 0.73 3.19 2.32 22.50 938 98 85 195

GRI0173 Grignolino 11.84 23.40 70.80 1.80 0.89 1.00 3.40 2.58 18.00 922 80 94 378

(Continued)

10002-PICO-9780123848628

and is confidential until formal publication.

TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

Name Category

Alcohol

(% abv)

Sugar-free

extract (g/l)

Fixed

acidity (g/l)

Tartaric

acid (g/l)

Malic

acid (g/l)

Uronic

acids (mg/l) pH

Ash

(g/l)

Alcalinity

of ash

(meq/l)

Potassium

(mg/l)

Calcium

(mg/l)

Magnesium

(mg/l)

Phosphate

(g/l)

GRI0273 Grignolino 12.67 24.30 74.10 1.70 0.98 0.88 3.35 2.24 18.00 840 81 99 336

GRI0373 Grignolino 12.16 25.80 78.90 1.84 1.61 0.78 3.37 2.31 22.80 845 98 90 285

GRI0473 Grignolino 11.65 22.90 62.90 1.80 1.67 0.64 3.55 2.62 26.00 1045 125 88 281

GRI0573 Grignolino 11.64 24.20 72.40 1.84 2.06 0.89 3.40 2.46 21.60 962 79 84 304

ERA0174 Barbera 12.86 26.80 87.30 0.99 1.35 0.92 3.22 2.32 18.00 830 52 122 266

ERA0274 Barbera 12.88 23.95 78.90 1.85 2.99 0.98 3.50 2.40 20.00 795 55 104 269

ERA0374 Barbera 12.81 24.45 76.20 2.93 2.31 0.87 3.64 2.40 24.00 785 49 98 266

ERA0474 Barbera 12.70 24.75 91.00 1.91 3.55 1.80 3.26 2.36 21.50 805 47 106 356

ERA0574 Barbera 12.51 23.50 104.70 1.34 1.24 0.98 3.50 2.25 17.50 975 60 85 273

ERA0674 Barbera 12.60 23.60 80.60 2.26 2.46 0.97 3.31 2.20 18.50 760 103 94 275

ERA0774 Barbera 12.25 25.30 91.40 1.42 4.72 1.25 3.40 2.54 21.00 995 105 89 262

ERA0874 Barbera 12.53 27.10 99.80 1.88 5.51 1.19 3.30 2.64 25.00 930 100 96 360

ERA0974 Barbera 13.49 25.70 115.50 2.17 3.59 1.47 3.24 2.19 19.50 825 111 88 315

ERA0176 Barbera 12.84 26.20 82.00 1.79 2.96 1.26 3.50 2.61 24.00 925 48 101 398

ERA0276 Barbera 12.93 26.78 80.00 1.69 2.81 1.15 3.31 2.70 21.00 965 40 96 351

ERA0376 Barbera 13.36 24.12 97.80 2.83 2.56 0.77 3.35 2.35 20.00 880 47 89 235

ERA0476 Barbera 13.52 27.90 85.00 1.46 3.17 1.23 3.28 2.72 23.50 880 38 97 325

ERA0576 Barbera 13.62 25.52 93.70 2.70 4.95 1.56 3.41 2.35 20.00 805 57 92 191

ERA0179 Barbera 12.25 23.40 113.50 3.54 3.88 1.04 3.01 2.20 18.50 785 77 112 358

ERA0279 Barbera 13.16 22.90 117.90 3.15 3.57 1.18 3.14 2.15 21.00 805 88 102 456

ERA0379 Barbera 13.88 21.40 99.30 2.81 5.04 1.29 3.28 2.23 20.00 750 43 80 171

ERA0479 Barbera 12.87 24.35 98.90 2.51 4.61 1.25 3.18 2.48 21.50 830 63 86 366

ERA0579 Barbera 13.32 21.46 96.90 2.85 3.24 1.75 3.30 2.38 21.50 790 42 92 306

ERA0178 Barbera 13.08 26.80 120.60 2.90 3.90 1.11 3.16 2.36 21.50 790 73 113 303

ERA0278 Barbera 13.50 26.50 105.50 2.31 3.12 1.31 3.23 2.62 24.00 980 67 123 338

ERA0378 Barbera 12.79 23.40 117.80 3.12 2.67 0.82 3.21 2.48 22.00 890 53 112 407

10002-PICO-9780123848628

and is confidential until formal publication.

ERA0478 Barbera 13.11 25.20 95.40 2.26 1.90 0.86 3.49 2.75 25.50 1140 74 116 289

ERA0578 Barbera 13.23 23.85 120.60 2.80 3.30 0.80 3.20 2.28 18.50 915 68 98 351

ERA0678 Barbera 12.58 21.75 102.70 2.92 1.29 0.79 3.21 2.10 20.00 875 107 103 368

ERA0778 Barbera 13.17 23.20 129.30 2.28 5.19 1.49 3.58 2.32 22.00 1045 102 93 241

ERA0878 Barbera 13.84 24.70 122.90 2.76 4.12 1.07 3.19 2.38 19.50 840 108 89 402

ERA0978 Barbera 12.45 25.35 105.90 2.23 3.03 1.24 3.62 2.64 27.00 1050 118 97 393

ERA1078 Barbera 14.34 29.10 97.50 2.73 1.68 1.60 3.42 2.70 25.00 1095 78 98 462

ERA1178 Barbera 13.48 26.95 102.50 3.75 1.67 1.37 3.41 2.64 22.50 1055 79 89 480

mean 13.13 25.07 82.46 1.97 2.22 0.95 3.35 2.37 19.27 868.7 72.6 100.6 369.5

variance 0.59 5.25 332.34 0.35 1.27 0.07 0.02 0.08 13.24 13102.4 536.4 194.7 7453.5

standard deviation 0.77 2.29 18.23 0.59 1.12 0.26 0.14 0.28 3.64 114.5 23.2 14.0 86.3

Chloride

(mg/l)

Total

phenols

(g/l) Flavanoids

Non-

ﬂavanoid

phenols

Pro-

anthocyanins

Color

intensity Hue

OD280/

OD315

of diluted

wines

OD280/

OD315

of ﬂavanoids

Glycerol

(g/l)

2-3-

butanediol

(g/l)

Total

nitrogen

(mg/l)

Proline

(mg/l)

Methanol

(% A.A.)

82 2.80 3.06 0.28 2.29 5.64 1.04 3.92 4.77 9.29 757 153 1065 113

90 2.65 2.76 0.26 1.28 4.38 1.05 3.40 3.80 8.93 881 194 1050 94

67 2.80 3.24 0.30 2.81 5.68 1.03 3.17 3.46 11.74 900 206 1185 125

49 3.85 3.49 0.24 2.18 7.80 0.86 3.45 3.54 10.13 1119 292 1480 80

65 2.80 2.69 0.39 1.82 4.32 1.04 2.93 3.22 10.27 799 215 735 73

58 3.27 3.39 0.34 1.97 6.75 1.05 2.85 3.16 10.85 865 364 1450 68

52 2.50 2.52 0.30 1.98 5.25 1.02 3.58 3.94 9.05 931 378 1290 80

64 2.60 2.51 0.31 1.25 5.05 1.06 3.58 3.94 10.13 865 358 1295 100

58 2.80 2.98 0.29 1.98 5.20 1.08 2.85 3.03 9.89 825 438 1045 141

64 2.98 3.15 0.22 1.85 7.22 1.01 3.55 3.75 12.65 788 350 1045 121

61 2.95 3.32 0.22 2.38 5.75 1.25 3.17 3.27 8.59 964 378 1510 123

79 2.20 2.43 0.26 1.57 5.00 1.17 2.82 3.04 11.52 894 294 1280 134

257 2.60 2.76 0.29 1.81 5.60 1.15 2.90 2.92 12.24 784 289 1320 164

50 3.10 3.69 0.43 2.81 5.40 1.25 2.73 2.82 12.29 766 224 1150 105

(Continued)

10002-PICO-9780123848628

and is confidential until formal publication.

TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

Chloride

(mg/l)

Total

phenols

(g/l) Flavanoids

Non-

ﬂavanoid

phenols

Pro-

anthocyanins

Color

intensity Hue

OD280/

OD315

of diluted

wines

OD280/

OD315

of ﬂavanoids

Glycerol

(g/l)

2-3-

butanediol

(g/l)

Total

nitrogen

(mg/l)

Proline

(mg/l)

Methanol

(% A.A.)

55 3.30 3.64 0.29 2.96 7.50 1.20 3.00 3.32 9.53 1041 324 1547 114

50 2.85 2.91 0.30 1.46 7.30 1.28 2.88 3.12 7.92 812 229 1310 97

62 2.80 3.14 0.33 1.97 6.20 1.07 2.65 3.10 9.24 836 308 1280 113

58 2.95 3.40 0.40 1.72 6.60 1.13 2.57 2.66 9.41 722 274 1130 99

28 3.30 3.93 0.32 1.86 8.70 1.23 2.82 3.17 9.85 808 230 1680 135

67 2.70 3.03 0.17 1.66 5.10 0.96 3.36 4.00 10.39 726 227 845 119

73 3.00 3.17 0.24 2.10 5.65 1.09 3.71 3.75 10.30 828 225 780 145

62 2.41 2.41 0.25 1.98 4.50 1.03 3.52 3.66 11.88 589 237 770 123

134 2.61 2.88 0.27 1.69 3.80 1.11 4.00 4.31 8.81 715 270 1035 109

73 2.48 2.37 0.26 1.46 3.93 1.09 3.63 3.82 9.14 568 248 1015 102

47 2.53 2.61 0.28 1.66 3.52 1.12 3.82 4.00 9.45 667 210 845 86

82 2.63 2.68 0.47 1.92 3.58 1.13 3.20 3.63 10.34 753 238 830 124

52 2.85 2.94 0.34 1.45 4.80 0.92 3.22 4.44 9.96 854 285 1195 85

46 2.40 2.19 0.27 1.35 3.95 1.02 2.77 3.10 9.70 757 350 1285 84

76 2.95 2.97 0.37 1.76 4.50 1.25 3.40 3.72 9.53 702 280 915 99

53 2.65 2.33 0.26 1.98 4.70 1.04 3.59 3.77 9.94 689 293 1035 100

52 1.98 0.57 0.28 0.42 1.95 1.05 1.82 2.12 5.40 736 287 520 98

108 2.05 1.09 0.63 0.41 3.27 1.25 1.67 1.42 6.90 658 345 680 127

53 2.02 1.41 0.53 0.62 5.75 0.98 1.59 1.86 8.20 691 321 450 60

47 2.10 1.79 0.32 0.73 3.80 1.23 2.46 1.73 8.60 797 262 630 87

306 3.50 3.10 0.19 1.87 4.45 1.22 2.87 3.07 7.20 748 141 420 157

116 1.89 1.75 0.45 1.03 2.95 1.45 2.23 2.73 7.50 627 219 355 58

69 2.42 2.65 0.37 2.08 4.60 1.19 2.30 2.60 7.96 680 259 678 118

148 2.98 3.18 0.26 2.28 5.30 1.12 3.18 3.33 8.20 604 100 502 114

54 2.11 2.00 0.27 1.04 4.68 1.12 3.48 4.07 7.10 554 425 510 98

111 2.53 1.30 0.55 0.42 3.17 1.02 1.93 1.92 8.10 704 363 750 137

10002-PICO-9780123848628

and is confidential until formal publication.

88 1.85 1.28 0.14 2.50 2.85 1.28 3.07 3.23 6.81 714 195 718 116

50 1.10 1.02 0.37 1.46 3.05 0.91 1.82 2.00 6.38 661 301 870 78

59 2.95 2.86 0.21 1.87 3.38 1.36 3.16 3.52 7.62 748 170 410 99

43 1.88 1.84 0.27 1.03 3.74 0.98 2.78 3.50 8.04 614 160 472 64

35 3.30 2.89 0.21 1.96 3.35 1.31 3.50 3.60 8.00 731 293 985 113

48 3.38 2.14 0.13 1.65 3.21 0.99 3.13 3.15 7.70 563 183 886 57

59 1.61 1.57 0.34 1.15 3.80 1.23 2.14 2.35 6.14 596 109 428 129

122 1.95 2.03 0.24 1.46 4.60 1.19 2.48 2.85 8.40 756 167 392 145

58 1.72 1.32 0.43 0.95 2.65 0.96 2.52 3.25 5.22 514 246 500 122

99 1.90 1.85 0.35 2.76 3.40 1.06 2.31 2.70 7.96 654 259 750 59

52 2.83 2.55 0.43 1.95 2.57 1.19 3.13 3.82 8.66 700 227 463 101

27 2.42 2.26 0.30 1.43 2.50 1.38 3.12 3.52 5.80 645 199 278 95

64 2.20 2.53 0.26 1.77 3.90 1.16 3.14 3.33 7.38 664 199 714 111

53 2.00 1.58 0.40 1.40 2.20 1.31 2.72 3.50 8.11 548 203 630 115

48 1.65 1.59 0.61 1.62 4.80 0.84 2.01 2.07 8.64 649 207 515 114

95 2.20 2.21 0.22 2.35 3.05 0.79 3.08 3.81 6.36 586 138 520 141

70 2.20 1.94 0.30 1.46 2.62 1.23 3.16 3.60 7.90 600 217 450 121

54 1.78 1.69 0.43 1.56 2.45 1.33 2.26 2.92 8.04 643 195 495 116

36 1.92 1.61 0.40 1.34 2.60 1.36 3.21 3.27 9.54 608 262 562 120

70 1.95 1.69 0.48 1.35 2.80 1.00 2.75 3.60 7.97 523 223 680 120

46 1.51 1.25 0.21 0.94 4.10 0.76 1.29 1.26 6.43 673 252 630 122

72 1.30 1.22 0.24 0.83 5.40 0.74 1.42 1.34 10.10 918 319 530 102

67 1.15 1.09 0.27 0.83 5.70 0.66 1.36 1.24 10.02 1095 258 560 132

118 1.70 1.20 0.17 0.84 5.00 0.78 1.29 1.23 8.52 1020 238 600 121

29 2.00 0.58 0.60 1.25 5.45 0.75 1.51 1.40 8.32 764 178 650 79

77 1.62 0.66 0.63 0.94 7.10 0.73 1.58 1.37 6.47 573 174 695 100

144 1.38 0.47 0.53 0.80 3.85 0.75 1.27 1.12 8.25 680 217 720 107

6 1.79 0.60 0.63 1.10 5.00 0.82 1.69 1.80 8.35 821 230 515 139

56 1.62 0.48 0.58 0.88 5.70 0.81 1.82 2.23 10.40 700 245 580 150

(Continued)

10002-PICO-9780123848628

and is confidential until formal publication.

TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

Chloride

(mg/l)

Total

phenols

(g/l) Flavanoids

Non-

ﬂavanoid

phenols

Pro-

anthocyanins

Color

intensity Hue

OD280/

OD315

of diluted

wines

OD280/

OD315

of ﬂavanoids

Glycerol

(g/l)

2-3-

butanediol

(g/l)

Total

nitrogen

(mg/l)

Proline

(mg/l)

Methanol

(% A.A.)

15 2.32 0.60 0.53 0.81 4.92 0.89 2.15 2.25 10.60 940 269 590 132

25 1.54 0.50 0.53 0.75 4.60 0.77 2.31 2.34 10.62 955 260 600 82

71 1.40 0.50 0.37 0.64 5.60 0.70 2.47 2.60 10.41 814 216 780 106

21 1.55 0.52 0.50 0.55 4.35 0.89 2.06 2.21 10.20 976 201 520 118

16 2.00 0.80 0.47 1.02 4.40 0.91 2.05 2.55 8.90 899 205 550 140

14 1.38 0.78 0.29 1.14 8.21 0.65 2.00 2.23 8.16 521 218 855 97

17 1.50 0.55 0.43 1.30 4.00 0.60 1.68 2.24 5.61 696 252 830 63

10 0.98 0.34 0.40 0.68 4.90 0.58 1.33 1.81 7.94 670 156 415 154

50 1.70 0.65 0.47 0.86 7.65 0.54 1.86 2.10 8.52 806 213 625 122

21 1.93 0.76 0.45 1.25 8.42 0.55 1.62 2.19 6.12 604 219 650 106

50 1.41 1.39 0.34 1.14 9.40 0.57 1.33 1.26 7.36 733 164 550 114

106 1.40 1.57 0.22 1.25 8.60 0.59 1.30 1.29 6.28 568 129 500 107

127 1.48 1.36 0.24 1.26 10.80 0.48 1.47 1.40 7.00 898 154 480 91

55 2.20 1.28 0.26 1.56 7.10 0.61 1.33 1.25 8.57 905 249 425 125

35 1.80 0.83 0.61 1.87 10.52 0.56 1.51 1.42 10.80 915 154 675 84

100 1.48 0.58 0.53 1.40 7.60 0.58 1.55 1.34 7.52 924 142 640 100

84 1.74 0.63 0.61 1.55 7.90 0.60 1.48 1.31 9.50 969 207 725 84

6 1.80 0.83 0.48 1.56 9.01 0.57 1.64 1.92 9.29 902 159 480 132

53 1.90 0.58 0.63 1.14 7.50 0.67 1.73 2.18 10.20 865 252 880 118

49 2.80 1.31 0.53 2.70 13.00 0.57 1.96 2.25 10.82 764 223 660 182

35 2.60 1.10 0.52 2.29 11.75 0.57 1.78 2.09 11.09 1080 250 620 160

66.5 2.24 1.90 0.36 1.51 5.27 0.97 2.51 2.75 8.79 759.7 240.4 779.3 110.2

1977.3 0.40 0.99 0.02 0.35 4.90 0.06 0.62 0.87 2.76 20104.3 4691.8 102493.8 649.9

44.5 0.63 1.00 0.13 0.59 2.21 0.25 0.78 0.93 1.66 141.8 68.5 320.1 25.5

10002-PICO-9780123848628

and is confidential until formal publication.

p0130 Figure 2.1 shows examples of histograms for

a portion of the data given in Table 2.1, namely

for variables number 13, 21, and 26.

p0135 Three typical patterns are noticeable. In

particular, variable 13 (phosphate) shows

a unimodal and almost symmetric shape, which

may suggest that such variable follows a normal

probability distribution (Fig. 2.1a).

p0140 Conversely, variable 21 (OD280/OD315 of

diluted wines) presents a bimodal distribution,

which may suggest that this variable to be char-

acterized by different average values for diverse

sample classes (Fig. 2.1b). In such cases, histo-

grams could be drawn for each class separately,

to verify the trend of the within-class

distributions.

p0145Instead, the histogram shape for variable

26 (proline) reveals an underlying asymmetric

distribution (Fig. 2.1c). It is possible to

convert such behavior into an almost normal

150 200 250 300 350 400 450 500 550 600

0

1

2

3

4

5

6x 10

-3

Phosphate

Relative Frequenc y

(a)

11.5 22. 5 33.5 4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

OD280/OD315

Relative Frequenc y

(b)

200 400 600 800 1000 1200 1400 1600 1800

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Proline

Relative Frequenc y

x 10

-3

(c)

f0010 FIGURE 2.1 Histograms for three variables of Table 2.1: phosphate (a), OD280/OD315 of diluted wines (b), and proline (c).

2.2. FROM UNIVARIATE TO MULTIVARIATE 35

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

one, simply by applying a logarithmic trans-

formation to the variable, as it is shown in

Fig. 2.2.

s0030 2.2.2. Normality Tests

p0150 Assessing for compatibility with a normal

distribution is a basic issue in data analysis,

because many methods require variables to be

normally distributed. As observed, frequency

distributions may be employed for this purpose.

Visual examination of histogram shapes may

supply a preliminary evaluation. Besides, the

cumulative empirical frequency distributions

(EFDs) constitute the basis for a family of statis-

tical normality tests, which are usually referred

to as KolmogoroveSmirnov tests (Kolmogorov,

1933; Smirnov, 1939).

p0155 One of the most effective and employed,

among them, is the Lilliefors test, which may

be used for generally assessing how well an

empirical distribution ﬁts with a theoretical

one (Lilliefors, 1970). In the case for normality

veriﬁcation, the null hypothesis (H

0

) is that the

observed empirical frequency distribution for

a given variable is not signiﬁcantly different

from the theoretical normal probability distribu-

tion, at a given signiﬁcance level. The alternative

hypothesis (H

1

) is that the observed EFD is not

compatible with the theoretical normal distribu-

tion, at that signiﬁcance level.

p0160The test procedure consists in ordering the

values of the variable to be tested and normal-

izing them by means of a Student’s transforma-

tion (or autoscaling):

x

i;v¼xi;vxv

sv

(2.1)

The variable is corrected by subtracting its

mean (xv)fromeachofitsvaluesandthen

dividing by its standard deviation (s

v

). The

autoscaled variable is dimensionless and

present mean equal to 0 and standard deviation

equal to 1.

p0165Then, the corresponding cumulative theoret-

ical probability distribution is estimated from

the statistical parameters computed, and the

maximumdistancefromsuchhypothesized

distribution and the empirical one is calcu-

lated. This value is compared with a critical

distance value, at a predetermined signiﬁcance

level, and such comparison determines the

acceptance/rejection of the null hypothesis.

The critical values, which depend on the

sample size, were obtained by Monte Carlo

simulations and are available on tables or

statistical software.

p0170The Lilliefors test can be performed also in

a graphical way (Iman, 1982), as it is illustrated

in Figs 2.3 and 2.4 for the same cases of Figs 2.1

and 2.2.

p0175Charts for the Lilliefors test report the cumu-

lative empirical frequency distributions (EFDs)

for variables number 13, 21, and 26 of Table 2.1,

after column autoscaling (red lines in Figs 2.3

and 2.4), together with the cumulative theoret-

ical probability distribution (blue solid lines),

and the distance limits according to the

Lilliefors test, at a 5% signiﬁcance (blue dot

lines). When the EFD curve intersects at least

one of the limits individuated by the critical

5.5 6 6.5 7 7.5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

log Proline

Relative Frequency

f0015 FIGURE 2.2 Histogram for the Log-transformed variable

proline of Table 2.1.

2. DATA ANALYSIS AND CHEMOMETRICS36

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

distance, the null hypothesis is rejected. As for

the examples reported in Fig. 2.3, the null

hypothesis is accepted only for the variable

phosphate, while for both the other variables

examined, it is rejected at the same signiﬁcance

level. In fact, only in the ﬁrst case (Fig. 2.3a), the

red EFD line does not intersect the critical

distance lines in any point.

p0180 In addition, it can be easily veriﬁed econ-

ﬁrming the deductions made by looking at the

histogram of Fig. 2.2 ethat the logarithmic

transformation applied to the variable proline

makes it compatible with the normal distribu-

tion (see Fig. 2.4).

s00352.2.3. ANOVA

p0185Analysis of variance (ANOVA) is the name of

a group of statistical methods based on Fisher ’s

Ftests, generally aimed at verifying the

existence/absence of signiﬁcant differences

between groups of data. The null hypothesis

-3 -2 -1 0 1 2 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Phosphate (autoscaled)

Cumulative di stribution

empirical frequency distribution

theoretical probability distribution

critical distance (α=5%)

-3 -2 -1 0 1 2 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

OD280/OD315 of diluted wines (autoscaled)

Cumulative di stribution

empirical frequency distribution

theoretical probability distribution

critical distance (α=5%)

-3 -2 -1 0 1 2 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Proline (autoscaled)

Cumulative distribution

empirical frequency distribution

theoretical probability distribution

critical distances (α=5%)

(a)

(c)

(b)

f0020 FIGURE 2.3 Graphical Lilliefors normality test for three variables of Table 2.1: phosphate (a), OD280/OD315 of diluted

wines (b), and proline (c).

2.2. FROM UNIVARIATE TO MULTIVARIATE 37

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

H

0

is that all the data derive from the same

stochastic population, i.e., there is no signiﬁcant

difference between the groups considered. In

order to verify this hypothesis, the ﬁnal Ftest

compares the variability between groups with

the variability within groups (Box et al, 1978).

p0190 The simplest case is the one-way ANOVA,

whose procedure is described with a real

numerical example. The two columns of Table

2.2 report the values of the alcoholic degree for

Barolo and Barbera wine samples of the red-

wine data set (Table 2.1), respectively, and

some basic descriptive parameters. A summary

of all the parameters computed for the ANOVA

test is given in Table 2.3. The aim is to assess

whether there is a signiﬁcant difference between

the alcohol content of the two wines or not. In

fact, although the mean Barbera alcoholic

degree (13.07% abv) is noticeably minor than

the corresponding Barolo value (13.83% abv),

the two respective ranges overlap, so that it

might be suspected the observed difference to

be due to chance variations.

p0195 The within-columns variance can be computed

as a pool variance, under the hypothesis that the

variances of the different groups are homoge-

neous. When only two groups eand, conse-

quently, two variances eare being compared,

a Fisher’s Ftest is suitable to verify this prelimi-

nary hypothesis. In the given numerical example,

the test value is computed as

Ft¼s2

Barolo

s2

Barbera

¼0:274

0:254 ¼1:08 (2.2)

The Fcritical value, at a 5% right signiﬁcance

level and for 29 degrees of freedom (d.o.f.)

both at the numerator and at the denominator,

is 1.86. So it is possible to conclude that the vari-

ances of the two groups considered are not

signiﬁcantly different, at a 5% right signiﬁcance

level.

p0200In problems involving more than two

groups, the comparison among variances

can be performed with multiple Ftests on

all the possible pairs, or by means of the

Cochran’s test or the Bartlett’s tests (Snedecor

and Cochran, 1989). The former is valid when

there is an equal numbers of data in each

group, while the latter has a wider

applicability.

-3 -2 -1 0 1 2 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Log Proline (autoscaled)

Cumulative distribution

empirical frequency dist ribution

theoretical probabilit y distribution

critic al distance (α=5%)

f0025 FIGURE 2.4 Graphical Lilliefors normality test for the Log-transformed variable proline of Table 2.1.

2. DATA ANALYSIS AND CHEMOMETRICS38

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0205In the numerical example discussed, the

within-columns variance ecomputed as pooled

variance ecorresponds to

s2

within ¼PC

c¼1PNC

n¼1ðxnc xcÞ2

NC

¼15:314

58 ¼0:264

(2.3)

p0210Conversely, the between-columns variance is

computed as

s2

between ¼PC

c¼1NCðxcxÞ

C1

¼8:786

1¼8:786

(2.4)

p0215Finally, the ANOVA Ftest value is computed

as between-columns variance:within-columns

variance ratio, to perform the ﬁnal test which

compares these variations, for the veriﬁcation

of the ANOVA null hypothesis:

FANOVA ¼s2

between

s2

within

¼8:786

0:264 ¼32:28 (2.5)

p0220The Fcritical value, at a 5% right signiﬁ-

cance level, for 1 degree of freedom at the

numerator and 58 degrees of freedom at the

t0015 TABLE 2.2 Alcohol Content (% abv) for Barolo and

Barbera Samples of Data Set Red Wines,

and Basic Statistical Parameters

Barolo Barbera

14.23 12.86

13.20 12.88

13.16 12.81

14.37 12.70

13.24 12.51

14.20 12.60

14.39 12.25

14.06 12.53

14.83 13.49

13.86 12.84

14.10 12.93

14.12 13.36

13.75 13.52

14.75 13.62

14.38 12.25

13.63 13.16

14.30 13.88

13.83 12.87

14.19 13.32

13.64 13.08

14.06 13.50

12.93 12.79

13.71 13.11

12.85 13.23

13.50 12.58

13.05 13.17

13.39 13.84

13.30 12.45

13.87 14.34

14.02 13.48

Min 12.85 12.25

Max 14.83 14.34

Mean 13.83 13.07

Variance 0.274 0.254

Standard deviation 0.524 0.504

t0020TABLE 2.3 Full ANOVA Parameters for the Data

given in Table 2.2. Computed Fratio

(from variances of columns Barolo and

Barbera) ¼1.08. Critical Fvalue (at 5%

signiﬁcance) ¼1.86. Ftest on variances

of columns Barolo and Barbera:

signiﬁcance ¼41.8%

½

AU4

Source of variation d.o.f. Sum of squares Variance

Total 60 10874.485

Mean 1 10850.384

Between columns 1 8.786 8.786

Within columns 58 15.314 0.264

Computed Fratio ¼33.28; Critical Fvalue (at 5% signiﬁcance) ¼4.01;

ANOVA Ftest: signiﬁcance ¼0.0%.

2.2. FROM UNIVARIATE TO MULTIVARIATE 39

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

denominator, is 4.01. From the comparison

with the computed test value, it follows that

the null hypothesis is rejected at a 5% signif-

icance level. The conclusion is that the differ-

ence between alcoholic content of Barolo

and Barbera samples is signiﬁcantly larger

than the variability within each of the two

groups.

p0225 ANOVA tests can be applied also when the

effect of two variability sources (e.g., type of

wine and vintage year) is to be veriﬁed: such

a scheme is usually called a two-way ANOVA.

When a number of replicate measurements are

available for each level combination of the two

factors (nested two-way ANOVA), the model

obtained also allows an estimation of the inter-

action between the factors, together with its

signiﬁcance.

s0040 2.2.4. Radar Charts

p0230 Radar charts ealso known as web charts,

spider charts, star charts, cobweb charts, polar

charts, star plots, or Kiviat diagrams eare

a data display tool that can be considered as

a sort of link between univariate and multivariate

graphical representations (Chambers et al., 1983).

p0235 They consist of circular graphs divided into

a number of equiangular spokes, called radii.

Each radium represents one of the variables.

A point is individuated on it, whose distance

from the center is proportional to the magnitude

of the related variable for that datum. Finally, all

the data points ecorresponding to all the vari-

ables measured on a sample eare connected

with a line, which represent a sort of sample

proﬁle.

p0240 Usually, each plot represents a single sample,

and multiple observations are compared by

examining different plots. It is also possible to

overdraw several lines on the same chart,

although the outcome will be legible only for

small data sets. As a matter of fact, when the

number of samples is large, such graphical

representation is generally not very functional.

p0245Within radar charts, variables can be repre-

sented without any previous scaling, revealing

what variables are dominant for a given data

set. Nonetheless, when variables are character-

ized by considerably different scales (as in the

case for red-wine data of Table 2.1), a prelimi-

nary transformation may be helpful in order to

make visible within the graph the contribution

of all of them, by assuring the same a priori

importance.

p0250For instance, by looking at Fig. 2.5, it clearly

appears that, without any scaling, four features

are dominating, corresponding to the variables

number 10, 13, 24, and 16, which are character-

ized by the highest mean values (see Table

2.2). The contribution of the remaining 23 vari-

ables is not recognizable within these graphs.

Furthermore, it is not possible to draw many

valuable considerations about the sample

proﬁles. In particular, it can be noticed that

Grignolino wines (Fig. 2.5a) are characterized,

on average, by smaller values for the four

observable variables. It can also be deduced

that Barolo has a higher contribution from vari-

able 26, while Barbera (Fig. 2.5c) has higher

contributions from variables 24 and 10.

p0255On the other hand, Fig. 2.6 illustrates that,

after application of column autoscaling (see

Eqn (2.1)), the a priori differences in location

and dispersion among the original variables

are eliminated, thus showing the contribution

of all of them and highlighting the differences

among the observations. In fact, in this second

graph, the proﬁles of the three wines appear

much more dissimilar than in the previous

one. By a joint examination of the three radar

charts of Fig. 2.6, it can be deduced that Barolo

and Barbera samples present two rather

complementary proﬁles, while the Grignolino

proﬁle is somewhat intermediate. In particular,

Barolo (Fig. 2.6a) is characterized by higher

average values of variables 1, 2, 13, 15, 16, 18,

21, 22, 23, and 26. Instead, Grignolino

(Fig. 2.6b) presents average lower values of all

the variables, except for the number 20. Finally,

2. DATA ANALYSIS AND CHEMOMETRICS40

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

Barbera (Fig. 2.6c) has a bigger contribution

from variables 3, 4, 5, 9, 17, 19, and 24.

p0260 Such deductions may be useful for character-

ization purposes.

s0045 2.3. MULTIVARIATE DATA

ANALYSIS

s0050 2.3.1. Principal-Component Analysis

p0265 Principal-component analysis (PCA), which

originates in the work of K. Pearson (1901), is

one of the basic and most useful tools in the

branch of multivariate analysis. It is an explor-

atory method, which always offers an overview

of the problem studied and often allows the

drawing of signiﬁcant conclusions and for deci-

sions to be made on the basis of the observed

results. Furthermore, PCA can be employed

for feature and noise reduction purposes and

constitutes the basis for other more complex

pattern recognition techniques.

p0270PCA is based on the assumption that a high

variability (i.e. a high variance value) is synony-

mous with a high amount of information.

f0035 FIGURE 2.6 Radar charts of average proﬁles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27

correspond to the variables listed in Table 2.1, reprocessed by application of column autoscaling.

f0030 FIGURE 2.5 Radar charts of average proﬁles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27

correspond to the original variables listed in Table 2.1.

2.3. MULTIVARIATE DATA ANALYSIS 41

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0275 For this reason, PCA algorithms search for

the maximum variance direction, in the multidi-

mensional space of the original data, preferably

passing through the data centroid, which means

that data have to be at least mean centered

column-wise. The maximum variance direction

represents the ﬁrst principal component (PC).

The second PC is the direction which keeps

the maximum variance among all directions

orthogonal to the ﬁrst PC. It follows that the

second PC explains the maximum information

not explained by the ﬁrst one or, in other words,

these two new variables are not inter-correlated.

The process continues with the identiﬁcation of

the subsequent PCs: it may stop at reaching

a variance cutoff value or continue until all the

variability enclosed in the original data has

been explained (Jolliffe, 2002).

p0280 Since the variance values depend on the

scale of the variables, it becomes difﬁcult to

compare and impossible to combine informa-

tion from variables of different nature, unless

properly normalized: column autoscaling (see

Eqn (2.1)) is the most commonly applied

transform.

p0285 Each sample can be projected in the space

deﬁned by the new variables: the coordinate

values obtained are called scores.

p0290 The PCs are expressible as linear combina-

tions of the original variables: the coefﬁcients

which multiply each variable are called load-

ings. They represent the cosine values (director

cosines) of the angles between the PCs and the

original variables. These values may vary

between e1 and þ1, indicating the importance

in deﬁning a given PC: the larger the cosine

absolute value, the closer the two directions,

thus the larger the contribution of the original

variable to the PC.

p0295 In terms of matrix algebra, the rotation from

the space of the original variables to the PC

space is performed by means of the loading

orthogonal matrix, L:

SNV ¼XNV LVV (2.6)

where Sis the score matrix and Xis the original

matrix, constituted by Nobjects (rows)

described by Vvariables (columns).

p0300One of the key features of PCA is its high

capability for representing large amounts of

complex information by way of simple bidimen-

sional or tridimensional plots.

p0305In fact, the space described by two or three

PCs can be used to represent the objects (score

plot), the original variables (loading plot), or

both objects and variables (biplot) (Geladi

et al., 2003; Kjeldahl and Bro, 2010). Since prin-

cipal components are not inter-correlated vari-

ables, no duplicate information is shown in PC

plots.

p0310Figure 2.7 represents an example of a highly

informative biplot, which derives from PCA

performed on the red-wine data set given in

Table 2.1. Data have been previously autoscaled,

in order to eliminate the magnitude differences

among the variables. The two Cartesian axes

correspond to the ﬁrst (meaning low-order)

two PCs, which together show a 44.2% of the

information (deﬁned as explained variance)

enclosed in the original multidimensional data

space.

p0315The plot clearly shows the interrelations

existing among samples, among variables, and

between samples and variables. Moreover,

considering also the additional row information

given in Table 2.1 (namely, the class of each

sample, graphically represented by different

colors), it is possible to get information about

the discrimination among the three wine cate-

gories (Barolo, Grignolino, and Barbera), the

dispersion of samples within each class, and

the discriminatory importance of the variables

measured.

p0320In particular, it appears that PC1 ewhich

accounts for the 27.4% of the total variance,

i.e., information eis a direction effective in dis-

tinguishing among the three wine classes, espe-

cially between Barolo (green scores) and

Barbera (blue scores) samples. Instead, PC2

(explaining the 16.8% of total variance) is useful

2. DATA ANALYSIS AND CHEMOMETRICS42

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

in differentiating mainly Grignolino samples

(red scores) from the other two groups.

p0325 The variables which present the highest

loading absolute value on PC1 are the numbers

15, 16, 21, 22, 3, 4, 5, 6, 9, and 17. It means that

such variables are the most important in

deﬁning PC1 and, consequently, in discrimi-

nating among the three wine classes, particu-

larly Barolo from Barbera. In more detail,

looking at the correspondences between scores

and loadings, it can be deduced by this plot

that the samples that present on average the

highest values for variables 3, 4, 5, 6, 9, and 17

and the lowest values for variables 15, 16, 21,

and 22 belong to class Barbera. Just opposite

considerations are applicable for the samples

of class Barolo (low values for the variables of

the ﬁrst group and high values for the variables

of the second group). Instead, Grignolino wines

lay in a halfway position, meaning that they are

characterized by intermediate values for all

these variables. Conversely, variable number

20 has the highest loading value on PC2 and it

is clearly in the same direction of the Grignolino

cluster. This means that Grignolino wines have,

on average, high values of variable 20. Opposite

considerations are valid for variables 1, 2, 8, 10,

19, 23, and 24.

p0330All of these inferences are absolutely in

accord with all the deductions already made

by inspection of the radar charts in Fig. 2.6.

Moreover, the information that variable 21

(OD280/OD315 of diluted wines) has a highly

discriminant power is in perfect agreement

with the bimodal distribution observed in histo-

gram of Fig. 2.1b for the same variable.

f0040 FIGURE 2.7 Example of PCA biplot for the autoscaled red-wine data. The scores (colored bottles) correspond to the wine

samples of classes Barolo (green), Grignolino (red), and Barbera (blue), respectively. The loadings (line segments and

numbers from 1 to 27) represent the contribution of the original variables eas listed in Table 2.1 eto the information

visualized in the plot.

2.3. MULTIVARIATE DATA ANALYSIS 43

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0335 Variables 7, 12, 25, and 27 have very small

loading values on both the PCs, meaning that

such variables give a negligible contribution to

the portion of information visualized in this

plot.

p0340 Further considerations may be drawn by

looking at the distribution of samples inside

each class. For instance, it can be easily seen

that Barolo wines are characterized by the

lowest within-class variability. In the case of

quality control, a low sample variability about

a target value is an index of high quality (Tagu-

chi, 1986).

p0345 It is worth noticing that a single and simple

biplot is able to report a considerable amount

of information that would require a large

number of univariate plots and tests to be

extracted: PCA is, without any doubt, the most

efﬁcient way to account for the information

enclosed in a data table.

s0055 2.3.2. Signal Pre-Processing

p0350 Instrumental analytical techniques com-

monly provide information in the form of digital

signals. Spectra, chromatograms, and voltam-

mograms are typical examples. Such signals

generally require to be suitably pretreated, since

the analytical information is not the exclusive

component. A number of different variations,

from sources other than the analytical system

under investigation, generally affect signals.

They may be related either with the electric

instrumentation components or with the

surroundings. In particular, unwanted signal

variations may be random or systematic. The

former are generally due to sporadic interfer-

ences or associated with random phenomena

(e.g., Brownian motions of particles and thermal

motion of electrons ethe so-called

JohnsoneNyquist noise) which usually follow

a standard normal or a Poisson probability

distribution. This type of noise, also called white

noise, is characterized by frequency values

higher than those of the useful signal.

p0355Instead, systematic unwanted variations are

commonly due to instrumental trends or to

external inﬂuences. They may affect the signal

with baseline shifts and/or drifts, which can

be considered as a low-frequency contribution.

p0360Signal processing is generally aimed at mini-

mizing the unwanted variations, thus

improving the quality of signals and, conse-

quently, the conversion of data to valuable

information.

p0365In particular, it is possible to individuate

three main objectives: reduction of random

noise, reduction of systematic unwanted varia-

tions, and reduction of data size.

p0370Several pre-processing techniques accom-

plish with more than one point. Furthermore,

in some cases, the transformation itself facili-

tates the interpretation of complex signals, as

in the case for derivatives.

p0375When several digital signals are structured

into a data matrix, each of them corresponding

to a row efollowing the chemometric conven-

tion esignal pre-processing is also known as

row pre-processing. The mathematical trans-

forms act on each single signal, independently

of the others.

p0380Techniques for reduction of random noise

include the moving average eor boxcar e

ﬁlters, the SavitzkyeGolay smoothing (Savitzky

and Golay, 1964), and the Fourier transform

(FT)-based ﬁlters (Reis et al., 2009).

p0385Regarding the elimination or minimization

of unwanted systematic effects, a number of

mathematical methods for signal transforma-

tion are widely employed, such as the

standard normal variate (SNV) transform and

derivatives.

s00602.3.2.1. Standard Normal Variate (SNV)

Transform

p0390The SNV transform, or row autoscaling, is

particularly applied in spectroscopy, since it is

useful to correct for both baseline shifts and

global intensity variations (Barnes et al., 1989).

Each signal (x

i

) is row-centered, by subtracting

2. DATA ANALYSIS AND CHEMOMETRICS44

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

its mean (xi) from each single value (x

i,v

), and

then scaled by dividing by the signal standard

deviation (s

i

):

x

i;v¼xi;vxi

si

(2.7)

p0395 After transformation, each signal presents

mean equal to 0 and standard deviation equal

to 1.

p0400 SNV has the peculiarity of possibly shifting

informative regions along the signal range, so

that the interpretation of the results referring

to the original signals should be performed

with caution (Fearn, 2009).

s0065 2.3.2.2. Derivatives

p0405 The numerical differentiation of digitized

signals may correct for baseline shifts and drifts,

depending on the derivation order. Further-

more, derivative proﬁles often exhibit an

increased apparent resolution of overlapping

peaks and may accentuate small structural

differences between nearly identical signals

(Taavitsainen, 2009).

p0410 The ﬁrst derivative of a signal y¼f(x)isthe

rate of change of ywith x(i.e.,y0¼dy/dx),

which can be interpreted eat the single

points eas the slope of the line tangent to

the signal. It provides a correction for baseline

shifts.

p0415 The second derivative can be considered as

a further derivation of the ﬁrst derivative

(y00 ¼d

2

y/dx

2

); it represents a measure of the

curvature of the original signal, i.e. the rate of

change of its slope. Such transform provides

a correction for both baseline shifts and drifts.

p0420 A disadvantageous consequence of deriva-

tion may be an enhancement of the random

noise, characterized by high-frequency slope

variations. To overcome this hurdle, signals are

ﬁrstly smoothed, often by using the

SavitzkyeGolay algorithm (Savitzky and Golay,

1964) with a third-order polynomial.

s00702.3.2.3. Horizontal Alignment

p0425When a series of chromatograms are used

as vectors to build a data matrix, a typical

problem arises with the horizontal shifts

that commonly characterizes such type of

data.

p0430The most common methods for peak align-

ment, such as the correlation optimized warp-

ing (COW), search for the maximum

correlation between a selected reference proﬁle

and a series of piecewise modiﬁed (shifted and

warped) versions of the signals to be aligned

(Nielsen et al., 1998; Jellema, 2009).

s00752.3.3. Supervised Data Analysis

and Validation

p0435Exploratory techniques for data analysis,

such as PCA, are unsupervised, meaning that

they just show the data as they are. Conversely,

supervised chemometric methods look for

determined features within data, explicitly

oriented to address particular issues.

p0440In particular, when a model is developed

with the purpose of predicting a qualitative or

quantitative property of interest, its reliability

in prediction should be assessed prior to using

the model in practice. Prediction ability values

should be presented together with their conﬁ-

dence interval (Forina et al., 2001; Forina et al.,

2007), which depends on the number of samples

used for the validation. The estimation of the

predictive ability on new samples enot used

for building the models eis a fundamental

step in any modeling process and several proce-

dures have been deployed for this purpose. The

most common validation strategies divide the

available samples into two subsets: a training

(or calibration) set used for calculating the

model and an evaluation set used for assessing

its reliability. A key feature for an honest valida-

tion is that the test samples have to be absolutely

extraneous to the model: no information from

them can be used in building the model, neither

2.3. MULTIVARIATE DATA ANALYSIS 45

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

in the pre-processing stages, otherwise the

prediction ability may be overestimated.

p0445 In many modeling techniques, some parame-

ters are optimized looking for a setting that

provides the maximum predictive ability for

the model for a given sample subset.

p0450 In such cases, a correct validation strategy

would involve three sample subsets: a training

set, an optimization set, and an evaluation set.

The optimization set is used to ﬁnd the best

modeling settings, while the actual reliability

of the ﬁnal model is estimated by way of a real

prediction on the third subset, formed by objects

that have never inﬂuenced the model.

p0455 The evaluation of the predictive ability of

a model can be performed in a unique step or

many times with different evaluation sets,

depending on the strategy adopted.

s0080 2.3.3.1. Single Evaluation Set

p0460 A single evaluation set is the simplest and

most rapid validation scheme. A fraction e

usually between 50% and 90% of the total

number eof the available samples constitutes

the training set, while the remaining objects

form the evaluation set. The subdivision may

be arbitrary, random, or performed by way of

a uniform design, such as the Kennard and

Stone and the duplex algorithm (Kennard and

Stone, 1969; Snee, 1977), which allows two

subsets that are uniformly distributed and

representative of the total sample variability to

be obtained.

s0085 2.3.3.2. Cross-Validation (CV)

p0465 Cross-validation is probably the most

common validation procedure. The Navailable

samples are divided into Gcancellation groups

following a predetermined scheme (e.g., contig-

uous blocks or Venetian blinds). The model is

computed Gtimes: each time, one of the cancel-

lation groups is used as the evaluation set, while

the other groups constitute the training set. At

the end of the procedure, each sample has

been used Ge1 times for building a model

and once for evaluation. The number of cancel-

lation groups usually ranges from 3 to N.

Cross-validation with Ncancellation groups is

generally known as the leave-one-out procedure

(LOO). LOO has the advantage of being unique

for a given data set, whereas, when G<N,

different orders of the samples and different

subdivision schemes generally supply different

outcomes. However, especially when the total

number of samples is considerable, predictions

made on a unique object, although repeated

many times, may yield an overly optimistic

result. An extensive evaluation strategy consists

in performing cross-validation many times,

with a different numbers of cancellation groups,

from 3 up to N. Another possibility is to repeat

the validation, for a given number G<Nof

cancellation groups, each time permuting the

order of the samples, thus obtaining a different

group composition each time.

s00902.3.3.3. Repeated Evaluation Set

p0470This procedure, also called Monte Carlo vali-

dation, computes many models (often many

thousands), each time creating a different evalu-

ation set, with a variable number of samples, by

random selection. Each sample may fall many

times, or even no times at all, in the evaluation

set. The main drawback of this validation

strategy is the longer computational time.

s00952.3.4. Supervised Qualitative Modeling

s01002.3.4.1. Classiﬁcation and Class-Modeling

p0475A wide number of issues within food science

require qualitative answers. It is the case for the

characterization of ingredients or ﬁnished

products, the veriﬁcation of the geographical

origin, and emore generally ethe quality

control, the control on food adulterations, and

so on.

p0480Discriminant classiﬁcation and class-

modeling techniques represent the most

common chemometric tools for addressing

such aims. In fact, they build mathematical rules

2. DATA ANALYSIS AND CHEMOMETRICS46

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

or models able to characterize a sample with

respect to a qualitative property, namely the

class to which it belongs.

p0485 A class (or category) is deﬁned as a group of

samples having in common the same values of

discrete variables or proximate values of contin-

uous variables. Frequently, such variables are

qualitative factors that cannot be determined

experimentally, so that they have to be esti-

mated from the values of some experimentally

measurable predictors, by way of suitable math-

ematical tools.

p0490 In more detail, discriminant classiﬁcation

techniques are able to determine to which class

a sample more probably belongs, among

a number of predeﬁned classes. They work by

building a delimiter between the classes and,

then, each new object is always assigned to the

category to which it more probably belongs,

even in the case of objects which are not perti-

nent to any class studied.

p0495 Instead, class-modeling techniques verify

whether a sample is compatible or not with the

characteristics of a given class of interest. In fact,

they provide an answer to the general question:

“Is sample X, claimed to belong to class A, actu-

ally compatible with the class A model?”. This is

essentially the question to be answered in most

of the real qualitative problems studied within

the food sciences. Such an approach is also

capable of detecting outliers (Forina et al., 2008).

s0105 2.3.4.2. Evaluation Parameters

p0500 The effectiveness of a classiﬁcation rule is

usually evaluated by the classiﬁcation rate, i.e.,

the percentage of objects correctly classiﬁed.

This parameter is often indicated as prediction

rate, when it is estimated by means of an evalu-

ation sample set. A class rule can be considered

valuable when the prediction rate should be

signiﬁcantly bigger than the null-classiﬁcation

rate, which is deﬁned as the probability

percentage of chance correct assignments, and

corresponds to 100% divided by the number of

categories.

p0505A class model is characterized by two param-

eters: sensitivity and speciﬁcity. Sensitivity is

deﬁned as the percentage of objects belonging

to the modeled class which are rightly accepted

by the model. Speciﬁcity is the percentage of

objects not belonging to the modeled class

which is rightly rejected by the model. A class-

modeling technique builds a class space, whose

wideness corresponds to the conﬁdence

interval, at a pre-selected conﬁdence level, for

the class objects: sensitivity is an experimental

measure of this conﬁdence level. A decrease in

the conﬁdence level for the modeled class gener-

ally reduces the sensitivity and increases the

speciﬁcity of the model. Frequently, in order to

evaluate the model performance, taking into

account these features, an efﬁciency parameter

is computed as the geometric mean of sensi-

tivity and speciﬁcity.

p0510When at least two classes are modeled, the

results of the class-modeling analysis can be

visualized by way of the Coomans’ plots

(Coomans et al., 1984). Such graphs represent

the samples in relation to the distances from

the models of two given classes. Often, the

distances are normalized dividing by the critical

distance value that characterizes the corre-

sponding model.

p0515In the example given in Fig. 2.8, the two

Cartesian axes correspond to the distances

from the model of class Barolo and from the

model of class Grignolino (data set red wines),

respectively, while two straight lines parallel to

the axes describe the limits of the corresponding

class spaces at a 95% conﬁdence level. The plot

area is divided into four regions, which contain

respectively: the samples accepted by the model

of class Barolo (upper left rectangle), the

samples accepted by the model of class Grigno-

lino (lower right rectangle), the samples

accepted by both the models (lower left square),

and the objects rejected by both the models

(upper right square). All the samples belonging

to the class Barbera correctly lay inside such

last area.

2.3. MULTIVARIATE DATA ANALYSIS 47

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0520 Classiﬁcation and class-modeling techniques

belong to three main families:

u0010 •distance-based techniques

u0015 •probabilistic techniques

u0020 •experience-based techniques.

s0110 2.3.4.3. Distance-Based Techniques

s0115 2.3.4.3.1. KNEAREST NEIGHBORS (K-NN)

p0540 k-NN is one of the simplest approaches for

classiﬁcation (Vandeginste et al., 1998). As the

ﬁrst step, k-NN computes the distances of the

test sample from each of the samples of

a training set, whose class membership is

known. Usually, the multivariate Euclidean

distance is employed:

Di;j¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

X

V

v¼1

ðxi;vxj;vÞ2

v

u

u

t¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ðxixjÞ0ðxixjÞ

q

(2.8)

f0045 FIGURE 2.8 Example of Coomans’ plot for the data set red wine given in Table 2.1. The samples are represented by class

colored symbols: green (Barolo), red (Grignolino), and blue (Barbera).

with x

i

and x

j

being two sample data vectors,

respectively, and x

i,v

the value of variable vfor

sample i.

p0545In some cases, the Mahalanobis distance can

be used. It can be considered as an Euclidean

distance modiﬁed for taking into account the

dispersion and the correlation of all of the

variables:

Di;j¼ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

ðxixjÞ0V1ðxixjÞ

q(2.9)

where Vis the covariance matrix.

p0550Once the matrix of distances between objects

has been computed, the ksamples nearest to

the test sample are then taken into consider-

ation to perform the classiﬁcation: generally,

a majority vote is employed, meaning that the

new object is classiﬁed into the class mostly

represented within the kselected objects. Being

a distance-based method, it is sensitive to the

2. DATA ANALYSIS AND CHEMOMETRICS48

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

measurement units and to the scaling proce-

dures applied.

p0555 The method provides a nonlinear delimiter

between categories, generally expressible as

a piecewise linear function (see Fig. 2.9). The

delimiter usually becomes more smoothed for

elevated values of k. When the parameter kis

optimized to obtain the highest prediction

ability for a given data set, validation should

be performed by way of a three-set procedure.

p0560 Being a nonprobabilistic method, k-NN is free

from statistical assumptions esuch as

normality of variable distributions eand

requirements from limitations on the number

of variables. This assures a wide applicability.

Furthermore, in many applications, it has been

shown to perform as well as or better than

more complex methods (Vandeginste et al.,

1998; Dudoit et al., 2002).

s01202.3.4.3.2. A NONPARAMETRIC CLASS-

MODELING TECHNIQUE

p0565Derde et al. (1986) presented a simple and efﬁ-

cient nonparametric class-modeling technique,

closely related to k-NN, which deﬁnes the class

space on the basis of a critical distance from the

objects of the training set (see Fig. 2.10). Several

settings can be varied, such as the type of distance

(Euclidean or Mahalanobis), and the strategies

adopted to determine the critical distance value.

Unfortunately, this promising class-modeling

technique has not been used anymore, and it

would merit a thorough reconsideration.

s01252.3.4.3.3. SOFT INDEPENDENT MODELING OF

CLASS ANALOGY (SIMCA)

p0570Soft independent modeling of class analogy

(SIMCA) (Wold and Sjo

¨stro

¨m, 1977) was the

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0050 FIGURE 2.9 Example of k-NN class delimiter for k¼1 (artiﬁcial data).

2.3. MULTIVARIATE DATA ANALYSIS 49

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

ﬁrst class-modeling technique introduced into

chemometrics. This method builds class

models based on PCA performed using only

the samples of the category studied, generally

after within-class autoscaling or centering. In

more detail, SIMCA models are deﬁned by

the range of the sample scores on a selected

number of low-order principal components

(PCs) eideally the signiﬁcant ones eand

models therefore correspond to rectangles

(two PCs), parallelepipeds (three PCs), or

hyper-parallelepiped (more than three PCs)

referred to as the multidimensional boxes of

SIMCA inner space. Conversely, the principal

components not used to describe the model

deﬁne the outer space, which represents the

space of uninformative variations, often due

to noise. The score range can be enlarged or

reduced, mainly depending on the number of

samples, to avoid the possibility of under- or

overestimation of the true variability (Forina

and Lanteri, 1984). The standard deviation of

the distance of the objects in the training set

from the model corresponds to the class stan-

dard deviation. The boundaries of the SIMCA

space around the model are determined (as

shown in Fig. 2.11) by a critical distance, which

is obtained by means of the Fisher statistics.

SIMCA is a very ﬂexible technique since it

allows variation in a large number of parame-

ters such as scaling or weighting of the original

variables, number of components, and

expanded or contracted score range.

s01302.3.4.4. Probabilistic Techniques

s01352.3.4.4.1. LINEAR DISCRIMINANT ANALYSIS

(LDA)

p0575Linear discriminant analysis (LDA) is

the ﬁrst multivariate classiﬁcation technique,

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X

1

X

2

f0055 FIGURE 2.10 Example of nonparametric class space (shadowed region enclosed by the red line) for the class of interest

(artiﬁcial data).

2. DATA ANALYSIS AND CHEMOMETRICS50

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

introduced by Fisher (1936). It is a probabilistic

parametric technique, i.e. it is based on the

estimation of multivariate probability density

functions, which are entirely described by

a minimum number of parameters: means, vari-

ances, and covariances. LDA is based on the

hypotheses that the probability density distribu-

tions are multivariate normal and that the

dispersion is the same for all the categories.

This means that the varianceecovariance matrix

is the same for all of the categories, while the

centroids are different (different location). In

the case of two variables, the probability density

function is bell-shaped and its elliptic section

lines correspond to equal probability density

values and to the same Mahalanobis distance

from the centroid.

p0580 Because of the above-mentioned LDA

hypotheses, the ellipses of different categories

present equal eccentricity and axis orientation:

they only differ for their location in the plane.

By connecting the intersection points of each

couple of corresponding ellipses, a straight

line is identiﬁed which corresponds to the

delimiter between the two classes (see

Fig. 2.12). For this reason, this technique is

called linear discriminant analysis. The direc-

tions which maximize the separation between

pairs of classes are the so-called canonical

variables.

s01402.3.4.4.2. QUADRATIC DISCRIMINANT

ANALYSIS (QDA)

p0585Quadratic discriminant analysis (QDA) is

a probabilistic parametric classiﬁcation technique

which represents an evolution of LDA for

nonlinear class separations. Also QDA, like

LDA, is based on the hypothesis that the

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0060 FIGURE 2.11 SIMCA normal-range model (red segment) and class space (shadowed region enclosed by the red line) for

the class of interest (artiﬁcial data).

2.3. MULTIVARIATE DATA ANALYSIS 51

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

probability density distributions are multivariate

normal but, in this case, the dispersion is not the

same for all of the categories. It follows that the

categories differ not only for the position of their

centroid but also for the varianceecovariance

matrix (different location and dispersion). Conse-

quently, the ellipses of different categories differ

also for eccentricity and axis orientation (Geisser,

1964). By connecting the intersection points of

each couple of corresponding ellipses (at the

same Mahalanobis distance from the respective

centroids), a quadratic delimiter is identiﬁed,

which is a parabola in the bidimensional case, as

represented in Fig. 2.13.

s0145 2.3.4.4.3. UNEQUAL CLASS MODELS (UNEQ)

p0590 UNEQ is a powerful class-modeling tech-

nique, which originated in the work of Hotelling

(1947) and was launched in chemometrics by

Derde and Massart (1986). The method, closely

related to QDA, is based on the hypothesis of

a multivariate normal distribution in each cate-

gory studied and on the use of Hotelling’s T

2

statistics to deﬁne a class space, whose boundary

is an ellipse (two variables), an ellipsoid (three

variables), or a hyper-ellipsoid (more than three

variables). The dispersion of the class space is

deﬁned by the critical value of the T

2

statistics

at a selected conﬁdence level (see Fig. 2.14). The

eccentricity and the orientation of the ellipse

depend on the correlation between the variables

and on their dispersion.

p0595These probabilistic techniques present some

restrictions on the number of objects that can

be used. From a strictly mathematical point

of view, objects have to be one more than

the number of variables measured. Neverthe-

less, in order to obtain reliable results, these

techniques should be applied in cases when

the ratio between the number of objects in

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0065 FIGURE 2.12 Iso-probability ellipses under LDA hypotheses and the resultant linear class delimiter (artiﬁcial data).

2. DATA ANALYSIS AND CHEMOMETRICS52

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

a given category and the number of variables

is at least three. Furthermore, the number of

objects in each class should be nearly

balanced: it is not advisable to work when

ratios between number of objects in different

categories are greater than three (Derde and

Massart, 1989).

p0600 In cases involving many variables, it is

possible to apply LDA and QDA-UNEQ

following a preliminary reduction in the vari-

able number, for instance, by a PCA-based

compression.

s0150 2.3.4.4.4. POTENTIAL FUNCTIONS METHODS

p0605 Potential function techniques were intro-

duced into chemometrics by Coomans and

Broeckaert (1986). These methods estimate

a probability density distribution as a sum of

contributions of each single sample in a training

set. A variety of functions can be used to deﬁne

the individual contributions. The most

commonly used are Gaussian-like functions,

with a smoothing coefﬁcient that is formally

analogous to the standard deviation of the

Gaussian probability function, thus deter-

mining the shape of the distribution. Such coef-

ﬁcient can be the same for all the samples of

a given class (ﬁxed potential), otherwise it can

be varied as a function of the local density of

samples: such latter strategy, known as normal

variable potential, is useful when the under-

lying multivariate distribution is very asym-

metric, with regions characterized by

nonuniform density of samples (Forina et al.,

1991).

p0610The value of the smoothing coefﬁcient can be

optimized by means of a leave-one-out proce-

dure with an optimization sample set.

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0070 FIGURE 2.13 Iso-probability ellipses under QDA hypotheses and the resultant quadratic class delimiter (artiﬁcial data).

2.3. MULTIVARIATE DATA ANALYSIS 53

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0615 As represented in Fig. 2.15, the resulting esti-

mated overall probability distribution can be

very complex, capable of effectively describing

nonuniform distributions of samples. From the

probability distribution, the boundary of the

class space can be obtained at a selected conﬁ-

dence level.

s0155 2.3.5. Supervised Quantitative

Modeling

p0620 Regression deﬁnes mathematical relationships

between variables or groups of variables, and

provides models for quantitative predictions.

p0625 Regression techniques can be univariate or

multivariate, depending on the number of

predictors and, eventually, of response variables

involved. Furthermore, they can be linear or

nonlinear, depending on the type of relationship

they are able to model.

p0630Univariate linear regression is a very common

tool in analytical chemistry, generally used to

describe the relation between a chemical quan-

tity (typically, the concentration of an analyte in

a series of standards) and a measured physical

variable (e.g., absorbance values at a given wave-

length). The mathematical model obtained is

used inversely, to compute the chemical quantity

in real samples from the values of the physical

measures performed on them.

s01602.3.5.1. Ordinary Least Squares (OLS)

p0635The ordinary least-squares (OLS) eor clas-

sical least-squares (CLS) emethod is probably

the most widely used and studied historically.

It looks for the combination of parameters of

the linear model (intercept and slope) that

provides the minimum value for the squared

residuals (i.e., the squared differences between

the values estimated by the model and the

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0075 FIGURE 2.14 UNEQ model (red cross) and class space (shadowed region enclosed by the red line) for the class of interest

(artiﬁcial data).

2. DATA ANALYSIS AND CHEMOMETRICS54

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

corresponding true values). The statistical

implications lying on the basis of the method

allow us to calculate the conﬁdence interval

for each predicted value.

p0640 OLS can be applied as well to multivariate

data, namely when the predictors are two or

more. In such cases, the method is also known

as multivariate linear regression (MLR) (Draper

and Smith, 1981).

p0645 The model can be expressed as a mathemat-

ical relationship between the response yand

the Vpredictors:

^

y¼b0þb1x1þb2x2þ.þbVxV(2.10)

that is, in the matrix notation:

^

y¼X0b(2.11)

where Xis the matrix of the predictors

augmented with a column of 1, necessary for

the estimation of the intercept values, and bis

the column vector of the regression coefﬁcients.

The regression coefﬁcients are estimated by

b¼ðX0XÞ1X0y(2.12)

The elements of the vector yare the reference

values of the response variable, used for

building the model.

p0650The uncertainty on the coefﬁcient estimation

varies inversely with the determinant of the

information matrix ðX0XÞwhich, in the case of

a unique predictor, corresponds to its variance.

In the multivariate cases, the determinant value

depends on the variance of the predictors and

on their inter-correlation: a high correlation

gives a small determinant of the information

matrix, which means a large uncertainty on

the coefﬁcients, and, consequently, unreliable

regression results.

p0655This is the typical situation when vectors cor-

responding to almost continuous signals (e.g.,

spectra) are used as predictors. In fact, in such

cases, contiguous variables are considerably

inter-correlated. In a spectrum, for instance,

-2 0 2 4 6 8 10 12

-2

0

2

4

6

8

10

12

X1

X2

f0080FIGURE 2.15 Potential functions

class space (shadowed region

enclosed by the red line) for the class

of interest (artiﬁcial data).

2.3. MULTIVARIATE DATA ANALYSIS 55

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

absorbances evaluated at two consecutive

wavelengths frequently carry almost the same

information, so that their correlation coefﬁcient

is nearly 1. In such cases, standard OLS is abso-

lutely not recommendable.

p0660 Furthermore, the number of objects required

for OLS regression must be at least equal to

the number of predictors plus 1. Such a condi-

tion is rarely satisﬁed in many practical cases.

s0165 2.3.5.2. Principal-Component Regression

(PCR)

p0665 Principal-component analysis offers a very

simple approach to overcome these hurdles.

The model is obtained by a classical least-

squares approach, which uses a reduced

number of signiﬁcant principal components,

computed from the original variables, as the

predictors (Jolliffe, 1982). The PCs are, by deﬁni-

tion, orthogonal and, therefore, uncorrelated.

p0670 This technique, which is very efﬁcient in

many cases, is known as principal-component

regression (PCR). Since not always the direc-

tions which explain the highest variance

amount (i.e. the lowest-order PCs) are the most

important in predicting a response variable, it

is possible to follow a reﬁned approach, which

performs a stepwise selection of the principal

components to be used in the model on the basis

of their modeling efﬁciency.

s0170 2.3.5.3. Partial Least Squares (PLS)

p0675 PLS (partial least squares, or even projections

onto latent structures) is probably the most

widely used multivariate regression technique

(Wold et al., 2001) and represents a better solu-

tion to both of the problems of variable number

and inter-correlation. The latent structures,

more frequently called latent variables (LVs) or

PLS components, are directions in the space of

the predictors. In particular, the ﬁrst latent vari-

able is the direction characterized by the

maximum covariance with the selected

response variable. The information related to

the ﬁrst latent variable is then subtracted from

both the original predictors and the response.

The second latent variable is orthogonal to the

ﬁrst one, being the direction of maximum

covariance between the residuals of the predic-

tors and the residuals of the response. This

approach continues for the subsequent LVs.

p0680The optimal complexity of the PLS model, i.e.

the most appropriate number of latent variables,

is determined by evaluating, with a proper vali-

dation strategy, the prediction error correspond-

ing to models with increasing complexity. The

parameters considered are usually the standard

deviation of the error of calibration (SDEC),

computed with the objects used for building

the model, and the standard deviation of the

error of prediction (SDEP), computed with

objects not used for building the model:

SDECðPÞ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

PN

i¼1ðyi^

yiÞ2

N

s(2.13)

where y

i

is the value of the response variable y

for sample i,^

yiis the corresponding value

computed or predicted by the model, and Nis

the number of samples.

p0685In general, the calibration error always

decreases when the number of LVs augments,

because the ﬁtting increases (toward an overﬁt-

ting). On the contrary, the prediction error

generally decreases until a certain model

complexity and then raises: this indicates that

the LVs further introduced are bringing noise,

as shown in the example given in Fig. 2.16.

A simple and practical criterion is the choice

of the LV number corresponding to the SDEP

absolute minimum or ebetter eto the ﬁrst local

minimum as the optimal model complexity. In

the example of Fig. 2.16, it corresponds to six

LVs.

p0690When the number of noisy (noninformative)

variables is too large, the performance of PLS

models may be improved by a selection of

useful predictors performed in advance (Forina

et al., 2007).

2. DATA ANALYSIS AND CHEMOMETRICS56

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

p0695 A number of PLS variants have been

deployed, for instance, for developing nonlinear

models and for predicting together two or more

response variables (PLS-2). Furthermore, when

category indices are taken as dummy response

variables, PLS may work as a classiﬁcation

method which is usually called PLS discrimi-

nant analysis (PLS-DA) or discriminant PLS

(D-PLS).

s0175 2.3.6. Artiﬁcial Neural Networks

p0700 Artiﬁcial neural networks (ANNs) are

a family of nonparametric versatile tools that

can be employed both for data exploration and

for qualitative and quantitative predictive

modeling. ANNs offer some advantages. For

instance, they are generally well suited for

nonlinear problems, and the related software

is easily available. Conversely, a number of

important drawbacks should limit ANN use

only to the cases in which other simpler tech-

niques fail and, primarily, when a large number

of samples are available.

p0705Multilayer feedforward neural networks

(MLF) represent the conﬁguration of ANNs

most widely applied to electronic tongue

data. An exempliﬁcative scheme is shown in

Fig. 2.17.

p0710MLF are composed by a number of computa-

tional elements, called neurons, generally orga-

nized in three layers (Zupan, 1994). In the ﬁrst

one, the input layer, there are usually Nneurons

I4

I...

I3

I2

I1

Hj

H...

H1

O2

O1

INPUT

LAYER

HIDDEN

LAYER

OUTPUT

LAYER

wn,j

wj,o

S

S

Z

S

Z

Correction

f0090FIGURE 2.17 Exempliﬁcative scheme of general inter-

neuronal connections and transmission/correction mecha-

nisms for a multilayer feedforward neural network.

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

SD

LV number

SDEC

SDEP

f0085FIGURE 2.16 Example of calibra-

tion and prediction error typical

trends at the increasing of the PLS

model complexity (number of latent

variables).

2.3. MULTIVARIATE DATA ANALYSIS 57

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

which correspond to the original predictors. The

predictors are scaled (generally range scaled).

When their number is very large, the principal

components are often used, in order to reduce

the data amount and the computational time.

p0715 The ﬁrst layer transmits the value of the

predictors to the second ehidden elayer. All

the neurons of the input layer are connected

to the Jneurons of the second layer by means

of weight coefﬁcients, meaning that the J

elements of the hidden layer receive, as infor-

mation, a weighted sum Sof the values from

the input layer. They transform the information

received (S)bymeansofasuitabletransfer

function, frequently a sigmoid.

p0720 These neurons transmit information to the

third eoutput elayer, as a weighted combina-

tion (Z) of their values. The neurons in the

output layer correspond to the response vari-

ables which, in the case for classiﬁcation, are

the coded class indices. The output neurons

transform the information Z, from the hidden

layer, by means of a further sigmoid or semilin-

ear function.

p0725 After a ﬁrst random initialization of the

values, a learning procedure modiﬁes the

weights w

n,j

and w

j

during several optimization

cycles, in order to improve the performances of

the net. The correction of the weights at each

step is proportional to the prediction error of

the previous cycle. The optimization of many

parameters and the elevated number of

learning cycles considerably increase the risk

of overﬁtting and, for this reason, a deep vali-

dation is required, with a consistent number

of objects.

p0730 Another type of widely employed ANNs is

represented by the Kohonen’s self-organizing

maps (SOMs), used for unsupervised explor-

atory analysis, and by the counterpropagation

(CP) neural networks, used for nonlinear regres-

sion and classiﬁcation (Kohonen, 2001). In addi-

tion, these tools require a considerable number

of objects to build reliable models, and a severe

validation.

s0180Acknowledgment

p0735The authors wish to thank Mr. Patrick Guerin for the careful

revision of the manuscript.

References

American Oil Chemist’s Society, 1998. Ofﬁcial Methods and

Recommended Practices of the American Oil Chemists’

Society, ﬁfth ed. AOCS, Champaign, IL.

Barnes, R.J., Dhanoa, M.S., Lister, S.J., 1989. Standard

normal variate transformation and de-trending of near-

infrared diffuse reﬂectance spectra. Appl. Spectrosc. 43

(5), 772e777.

Box, G.E.P., Hunter, W.G., Hunter, J.S., 1978. Statistics for

experimenters: an introduction to design, data analysis,

and model building. Wiley, New York.

Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A.,

1983. Graphical Methods for Data Analysis. Chapman &

Hall, New York.

Coomans, D., Broeckaert, I., Derde, M.P., Tassin, A.,

Massart, D.L., Wold, S., 1984. Use of a microcomputer for

the deﬁnition of multivariate conﬁdence regions in

medical diagnosis based on clinical laboratory proﬁles.

Comput. Biomed. Res. 17, 1e14.

Coomans, D., Broeckaert, I., 1986. Potential Pattern Recog-

nition in Chemical and Medical Decision Making.

Research Studies Press, England. Letchworth.

Derde, M.P., Kaufman, L., Massart, D.L., 1986. A non-

parametric class modelling technique. J. Chemomem. 3,

375e395.

Derde, M.P., Massart, D.L., 1986. UNEQ: a disjoint model-

ling technique for pattern recognition based on normal

distribution. Anal. Chim. Acta 184, 33e51.

Derde, M.P., Massart, D.L., 1989. Evaluation of the required

sample size in some supervised pattern recognition

techniques. Anal. Chim. Acta 223, 19e44.

Dudoit, S., Fridly, J., Speed, P., 2002. Comparison of

discrimination methods for the classiﬁcation of tumors

using gene expression data. J. Am. Stat. Assoc. 97, 77e87.

Draper, N.R., Smith, H., 1981. Applied Regression Analysis,

second ed. Wiley, New York.

Fearn, T., 2009. The effect of spectral pre-treatments on

interpretation. NIR News 20, 16e17.

Fisher, R.A., 1936. The use of multiple measurements in

taxonomic problems. Ann. Eugen. 7, 179e188.

Forina, M., Lanteri, S., 1984. Chemometrics: mathematics

and statistics in chemistry. In: Kowalski, B.R. (Ed.),

NATO ASI Series, Ser. C, vol. 138. Reidel Publ. Co.,

Dordrecht, pp. 439e466.

Forina, M., Armanino, C., Castino, M., Ubigli, M., 1986.

Multivariate data analysis as a discriminating method of

the origin of wines. Vitis 25, 189e201.

2. DATA ANALYSIS AND CHEMOMETRICS58

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

Forina, M., Armanino, C., Leardi, R., Drava, G., 1991. A

class-modelling technique based on potential functions.

J. Chemom. 5, 435e453.

Forina, M., Lanteri, S., Rosso, S., 2001. Conﬁdence intervals of

the prediction ability and performance scores of classiﬁ-

cationsmethods. Chemometr. Intell. Lab. Syst. 57,121e132.

Forina, M., Lanteri, S., Casale, M., 2007. Multivariate cali-

bration. J. Chromatogr. A 1158, 61e93.

Forina, M., Oliveri, P., Lanteri, S., Casale, M., 2008. Class-

modeling techniques, classic and new, for old and new

problems. Chemometr. Intell. Lab. Syst. 93, 132e148.

Geisser, S., 1964. Posterior odds for multivariate normal

distributions. J. R. Stat. Soc. Series B Stat Methodological

26, 69e76.

Geladi, P., Manley, M., Lestander, T., 2003. Scatter plotting in

multivariate data analysis. J. Chemom. 17, 503e511.

Hotelling, H., 1947. Multivariate Quality Control. In:

Eisenhart, C., Hastay, M.W., Wallis, W.A. (Eds.), Tech-

niques of Statistical Analysis. McGraw-Hill, New York,

pp. 111e184.

Iman, R.L., 1982. Graphs for use with the Lilliefors test for

normal and exponential distributions. Amer. Statist. 36,

109e112.

Jellema, R.H., 2009. Variable shift and alignment. In:

Brown, S.D., Tauler, R., Walczak, B. (Eds.), Comprehensive

Chemometrics, vol. 2. Elsevier, Amsterdam, pp. 85e108.

Jolliffe, I.T., 1982. Anote on the use of principal components in

regression. J. R.Stat. Soc. Ser. C (Appl. Stat.) 31 (3), 300e303.

Jolliffe, I.T., 2002. Principal Component Analysis, second ed.

Springer, New York. 201e207.

Kennard, R.W., Stone, L.A., 1969. Computer aided design of

experiments. Technometrics 11, 137e148.

Kjeldahl, K., Bro, R., 2010. Some common misunderstand-

ings in chemometrics. J. Chemom. 24, 558e564.

Kolmogorov, A., 1933. Sulla determinazione empirica di una

legge di distribuzione. G. Inst. Ital. Attuari 4, 83e91.

Kohonen, T., 2001. Self Organizing Maps, third ed. Springer,

New York, NY.

Lilliefors, H.W., 1970. On the KolmogoroveSmirnov test for

normality with mean and variance unknown. J. Amer.

Stat. Assoc. 62, 399e405.

Martens, H., Kohler, A., 2008. Bio-spectroscopy and bio-

chemometrics: high-throughput metabolic proﬁling for

integrative genetics. In: Proceedings of the Metab-

omeeting 2008 Conference, 28e29th April 2008. Ecole

Normale Supe

´rieure de Lyon, Lyon, France, p. 18.

Nielsen, N.-P.V., Carstensen, J.M., Smedsgaard, J., 1998.

Aligning of single and multiple wavelength chromato-

graphic proﬁles for chemometric data analysis using

correlation optimised warping. J. Chromatogr. A 805,

17e35.

Oliveri, P., Casale, M., Casolino, M.C., Baldo, M.A., Nizzi-

Griﬁ, F., Forina, M. A comparison between classical and

innovative class-modelling techniques for the charac-

terisation of a PDO olive oil, Anal. Bioanal. Chem, in

press. doi:10.1007/s00216-010-4377-1.

½

AU3

Pearson, K., 1901. On lines and planes of closest ﬁt to

systems of points in space. Philos. Mag. 2 (6),

559e572.

Reis, M.S., Saraiva, P.M., Bakshi, B.R., 2009. Denoising and

signal-to-noise ratio enhancement: wavelet transform

and fourier transform. In: Brown, S.D., Tauler, R.,

Walczak, B. (Eds.), Comprehensive Chemometrics,

vol. 2. Elsevier, Amsterdam, pp. 25e55.

Savitzky, A., Golay, M.J.E., 1964. Smoothing and differenti-

ation of data by simpliﬁed least squares procedure.

Anal. Chem. 36, 1627e1639.

Sharoba, A.M., Senge, B., El-Mansy, H.A., Bahlol, H.ElM.,

Blochwitz, R., 2005. Chemical, sensory and rheological

properties of some commercial German and Egyptian

tomato ketchups. Eur. Food Res. Technol. 220, 142e151.

Smirnov, N.V., 1939. On the estimation of the discrepancy

between empirical curves of distribution for two inde-

pendent samples. Bull. Math. Univ. Moscow 2, 3e14.

Snedecor, G.W., Cochran, W.G., 1989. Statistical Methods,

eighth ed. Iowa State University Press.

Snee, R., 1977. Validation of regression models: methods

and examples. Technometrics 19, 415e428.

Student, 1908. The probable error of a mean. Biometrika 6,

1e25.

Taavitsainen, V.M., 2009. Denoising and signal-to-noise ratio

enhancement: derivatives. In: Brown, S.D., Tauler, R.,

Walczak, B. (Eds.), Comprehensive Chemometrics,

vol. 2. Elsevier, Amsterdam, pp. 57e66.

Taguchi, G., 1986. Introduction to Quality Engineering.

Designing Quality into Products and Processes. Asian

Productivity Organization, ASI Press, Dearborn.

Valca

´rcel, M., Ca

´rdenas, S., 2005. Vanguard-rearguard

analytical strategies. Trends Anal. Chem. 24, 67e74.

Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De

Jong, S., Lewi, P.J., Smeyers-Verbeke, J., 1998. Handbook

of Chemometrics and Qualimetrics, vol. 20B. Elsevier,

Amsterdam.

Wold, S., 1972. Spline functions, a new tool in data-analysis.

Kem. Tidskr. 84, 34e37.

Wold, S., Sjo

¨stro

¨m, M., 1977. SIMCA: a method for analysing

chemical data in terms of similarity and analogy. In:

Kowalski, B.R. (Ed.), Chemometrics: Theory and Appli-

cations, ACS Symposium Series 52. American Chemical

Society, Washington, pp. 243e282.

Wold, S., Sjo

¨stro

¨m, M., Eriksson, L., 2001. PLS-regression:

a basic tool of chemometrics. Chemom. Intell. Lab. Syst.

58, 109e130.

Zupan, J., 1991. Introduction to artiﬁcial neural network

(ANN) methods: what they are and how to use them.

Acta Chim. Slov. 41, 327e352.

REFERENCES 59

I. CHEMICAL ANALYSIS OF FOOD

10002-PICO-9780123848628

and is confidential until formal publication.

AUTHOR QUERY FORM

Book:

PICO-9780123848628

Chapter:

02

Please e-mail your responses and any

corrections to:

E-mail: S.Vadivelan@elsevier.com

Dear Author,

Any queries or remarks that have arisen during the processing of your manuscript are listed below and are highlighted by

ﬂags in the proof. (AU indicates author queries; ED indicates editor queries; and TS/TY indicates typesetter queries.)

Please check your proof carefully and answer all AU queries. Mark all corrections and query answers at the appropriate

place in the proof (e.g., by using on-screen annotation in the PDF ﬁle http://www.elsevier.com/framework_authors/

tutorials/ePDF_voice_skin.swf) or compile them in a separate list, and tick off below to indicate that you have answered

the query.

Please return your input as instructed by the project manager.

Uncited reference: References that occur in the reference list but are not cited in the text. Please position each

reference in the text or delete it from the reference list.

NIL

Missing reference: References listed below were noted in the text but are missing from the reference list. Please

make the reference list complete or remove the references from the text.

NIL

Location in Chapter Query / Remark

AU:1, page 25 Citations “Oliveri et al., 2011; Zupan, 1994” have not been found in the reference list.

Please provide full details for these references or delete citations from text.

,

AU:2, page 27 Please note that, the terms “data set red wines” and “red-wine data set” both are used.

Please check and made consistent if necessary.

,

AU:3, page 59 References “Oliveri et al., in press; Zupan, 1991” have not been cited in text but occur in

the reference list. Please cite each reference in the text or, alternatively, delete them from

the reference list.

,

AU:4, page 39 Please check the layout of Table 2.3.

,

PICO: 02

Non-Print Items

Abstract:

Data mining is usually the last, but not for this less important, step of any food analysis process. It rather represents a critical

phase: in fact, a proper data processing allows the extraction of useful information about the system under study from large

amounts of collected data eand getting information is usually the main objective in analytical chemistry. The classical

univariate approach, which considers one variable at a time, underutilizes the global data structure and offers just a partial

image of it. Instead, multivariate strategies allow a more complete interpretation of data and exploitation of the

information contained therein. Multivariate techniques can be used both for exploratory purposes and for qualitative or

quantitative modeling. Generally, modeling is performed for predictive applications: in such cases, a thorough model

validation is always required.

Keywords: multivariate data analysis; chemometrics; pattern recognition; modeling; validation