Content uploaded by Michele Forina
Author content
All content in this area was uploaded by Michele Forina on Nov 19, 2017
Content may be subject to copyright.
CHAPTER
2
c0002 Data Analysis and Chemometrics
Paolo Oliveri, Michele Forina
Department of Drug and Food Chemistry and Technology, University of Genoa,
Via Brigata Salerno, Genoa, Italy
OUTLINE
2.1. Introduction 25
2.1.1. From Data to Information 25
2.2. From Univariate to Multivariate 27
2.2.1. Histograms 27
2.2.2. Normality Tests 36
2.2.3. ANOVA 37
2.2.4. Radar Charts 40
2.3. Multivariate Data Analysis 41
2.3.1. Principal-Component Analysis 41
2.3.2. Signal Pre-Processing 44
2.3.3. Supervised Data Analysis
and Validation 45
2.3.4. Supervised Qualitative Modeling 46
2.3.5. Supervised Quantitative Modeling 54
2.3.6. Artificial Neural Networks 57
s0010 2.1. INTRODUCTION
s0015 2.1.1. From Data to Information
p0020 Advances in technology and the increasing
availability of powerful instrumentation now
offer analytical food chemists the possibility for
obtaining high amounts of data on each sample
analyzed, in a reasonable eoften negligible e
time frame (Valca
´rcel and Ca
´rdenas, 2005).
p0025 Often, in fact, a single analysis may provide
a considerable number of measured quantities,
generally of the same nature. For instance, gas
chromatographic (GC) analysis of fatty acid
methyl esters allows us to quantify, with a single
chromatogram, the fatty acid composition of
a vegetable oil sample (American Oil Chemist’s
Society, 1998). Spectroscopic techniques as well
may supply, with a single and rapid analysis
on a sample, multiple data of homogeneous
nature: in fact, a spectrum can be considered as
a data vector, in which the order of the variables
(e.g., absorbances at consecutive wavelengths)
has a physical meaning (Oliveri et al., 2011
½
AU1
).
Chemical Analysis of Food: Techniques and Applications
DOI: 10.1016/B978-0-12-384862-8.00002-9 Copyright Ó2012 Elsevier Inc. All rights reserved.
25
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0030 In other cases, a set of samples can be
described by a number of heterogeneous chem-
ical and physical parameters at the same time.
For example, a global analytical characteriza-
tion of a tomato sauce may involve the quanti-
fication of color and rheological parameters as
well as pH and chemical composition and e
possibly ea number of sensorial responses
(Sharoba et al., 2005). Also in such cases, each
sample may be described by a data vector,
but without any implication with respect to
the order of the variables. Instead, differences
in variable magnitude and scale between
different variables may affect data analysis
if a proper pre-processing approach is not
followed.
p0035 The availability of large sets of data does not
mean at the immediate time availability of infor-
mation promptly accessible to the sample
analyzed: usually, in fact, a number of steps
are required to extract and properly interpret
the potential information embodied within the
data (Martens and Kohler, 2008).
p0040 A deep understanding of the nature of
analytical data is the first basic step for any
proper data treatment, because different data
types usually require different processing strat-
egies, which closely depend on their nature and
origin. For this reason, the data analyst should
always have a complete awareness of the
problem under study and about the whole
analytical process from which data derive e
from the sampling to the instrumental analysis.
Such knowledge is fundamental: it makes the
difference between a chemometrician and
a mathematician. A chemometrician is, first of
all, a chemist, who is acquainted with his
data, and utilizes mathematical methods for
the conversion of numerical records into rele-
vant chemical information.
p0045 The analytical food chemist William Sealy
Gosset (1876e1937), who worked at the
Arthur Guinness & Son brewery of Dublin,
can be considered as one of the fathers of
chemometrics. In fact, he studied a number
of statistical tools and adapted them to better
solve actual chemical problems. He had to
present his studies using a pseudonymous,
since his company did not permit him to
publish any data. Considering himself as
a modest contributor in the field, rather
than a statistician, he adopted the pen
name Student. His most famous work was
on the definition of the probability distribu-
tion that is commonly referred to as the
Student’s tdistribution (Student, 1908).
p0050The term chemometrics was used for the first
time by Svante Wold, in 1972, for identifying the
discipline that performs the extraction of useful
chemical information from complex experi-
mental systems (Wold, 1972).
p0055Statistics offers a number of helpful tools that
can be used for converting data into informa-
tion. Univariate methods, which consider one
variable at a time, independently of the others,
have been and are still extensively used for
such purposes. Nonetheless, they usually
supply just partial answers to the problems
under study, since they underutilize the
potential for discovering global information
embodied in the data. For instance, they are
not able to take into account inter-correlation
between variables ea feature that can be very
informative, if recognized and properly
interpreted.
p0060Multivariate strategies are able to take into
account such an aspect, allowing a more
complete interpretation of data structures.
However, in spite of their big potential, multi-
variate methods are generally less used than
univariate tools.
p0065On the other hand, a number of people try
multivariate analysis as the last-ditch resort,
when nothing seems to provide the desired
results, pretending that chemometrics provide
valuable information from data that do not
contain any informative feature at all.
p0070Such demeanor is very hazardous especially
when complex methods are being used, because
there may be the risk of employing chance
2. DATA ANALYSIS AND CHEMOMETRICS26
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
correlations to develop models with good
performances only on appearance enamely,
on the same samples used for model building
ebut with very poor prediction ability on new
samples: this is the so-called overfitting. To over-
come such a possibility, a proper validation of
models is always required. In particular, the
more complex the technique applied, the deeper
the validation recommended.
p0075 For these reasons, a good understanding of
the characteristics of the methods employed
for data processing is always advantageous
as well.
p0080 In this chapter, an overview of the chemomet-
ric techniques most commonly used for data
analysis in analytical food chemistry will be pre-
sented, highlighting potentials and limits of
each one.
s0020 2.2. FROM UNIVARIATE
TO MULTIVARIATE
p0085 A bidimensional table is probably the most
typical way to arrange, present, and store
analytical data: conventionally, in chemomet-
rics, each row usually represents one of the
samples analyzed, while each column corre-
sponds to one of the variables measured.
p0090 As an example, Table 2.1 reports the data set
red wines
½
AU2
, which consists of 27 chemical and
physical parameters measured on 90 wine
samples, belonging to three Italian denomina-
tions of origin from the same region (Piedmont):
Barolo, Grignolino, and Barbera. The original
data set was composed of 178 samples (Forina
et al., 1986).
p0095 Table 2.1 contains also additional informa-
tion, which is usually not processed but which
may be extremely helpful in the final under-
standing and interpretation of the results. In
particular, the two heading lines contain the
numbers and the names of the variables, which
are an additional information for columns,
while the two heading columns include the
names identifying the samples and their class,
which represent an additional information for
rows.
p0100It is easy to guess that such data enclose
a great deal of potential information. Anyway,
the simple visual inspection of the table,
which contains a considerable number of
records, does not provide directly any valu-
able information about the samples analyzed.
A conversion from data into information is
necessary.
p0105Univariate methods are still the most used
in many cases, although they generally offer
only a very limitative vision of the global
situation.
s00252.2.1. Histograms
p0110A good way to extract information from data
is to use graphical tools. Among them, histo-
grams are probably the most widely employed
(Chambers et al., 1983).
p0115To build a histogram, the range of interest
of the variable under study is divided into
a number of regular adjacent intervals. For
each interval, the contribution of the measured
samples is graphically displayed by a vertical
rectangle, whose area is proportional to the
frequency (i.e., the number of observations)
within that interval. Consequently, the height
of each rectangle is equal to the frequency
divided by the interval width, so that it has
the dimension of a frequency density.
p0120Frequently, such frequency values are
normalized, dividing each of them by the total
number of observations, thus obtaining rela-
tive values. It follows that, in such cases, the
sumoftheareasofalltherectanglesei.e.,
the sum of all the relative frequencies eis
equal to 1.
p0125The frequency distribution visualized by
a histogram can be used to estimate the prob-
ability distribution of the variable under
study and to make deductions about the
samples.
2.2. FROM UNIVARIATE TO MULTIVARIATE 27
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
t0010 TABLE 2.1 Red-Wine Data Set, and Basic Statistical Parameters
Name Category
Alcohol
(% abv)
Sugar-free
extract (g/l)
Fixed
acidity (g/l)
Tartaric
acid (g/l)
Malic
acid (g/l)
Uronic
acids (mg/l) pH
Ash
(g/l)
Alcalinity
of ash
(meq/l)
Potassium
(mg/l)
Calcium
(mg/l)
Magnesium
(mg/l)
Phosphate
(g/l)
OLO0171 Barolo 14.23 24.82 73.10 1.21 1.71 0.72 3.38 2.43 15.60 950 62 127 320
OLO0271 Barolo 13.20 26.30 72.80 1.84 1.78 0.71 3.30 2.14 11.20 765 75 100 395
OLO0371 Barolo 13.16 26.30 68.50 1.94 2.36 0.84 3.48 2.67 18.60 936 70 101 497
OLO0471 Barolo 14.37 25.85 74.90 1.59 1.95 0.72 3.43 2.50 16.80 985 47 113 580
OLO0571 Barolo 13.24 26.05 83.50 1.30 2.59 1.10 3.42 2.87 21.00 1088 70 118 408
OLO0671 Barolo 14.20 28.40 79.90 2.14 1.76 0.96 3.39 2.45 15.20 868 71 112 418
OLO0771 Barolo 14.39 27.02 64.30 1.64 1.87 0.95 3.42 2.45 14.60 889 67 96 306
OLO0871 Barolo 14.06 26.40 73.50 1.33 2.15 1.14 3.54 2.61 17.60 894 50 121 502
OLO0971 Barolo 14.83 26.80 69.50 1.82 1.64 0.67 3.30 2.17 14.00 765 49 97 440
OLO1071 Barolo 13.86 27.00 68.50 1.92 1.35 0.67 3.27 2.27 16.00 794 51 98 391
OLO1171 Barolo 14.10 26.08 72.50 1.64 2.16 0.62 3.31 2.30 18.00 838 61 105 399
OLO1271 Barolo 14.12 28.35 72.90 1.51 1.48 0.96 3.20 2.32 16.80 827 60 95 424
OLO1371 Barolo 13.75 30.25 75.10 1.92 1.73 0.64 3.18 2.41 16.00 752 65 89 453
OLO1471 Barolo 14.75 30.40 98.90 2.08 1.73 0.72 3.01 2.39 11.40 910 46 91 510
OLO1571 Barolo 14.38 27.10 72.30 1.95 1.87 0.67 3.20 2.38 12.00 927 29 102 523
OLO1671 Barolo 13.63 27.15 69.60 1.48 1.81 0.67 3.47 2.70 17.20 905 28 112 385
OLO1771 Barolo 14.30 27.90 74.90 1.41 1.92 0.82 3.40 2.72 20.00 860 108 120 513
OLO1871 Barolo 13.83 26.30 64.90 1.93 1.57 0.68 3.43 2.62 20.00 905 68 115 419
OLO1971 Barolo 14.19 26.40 72.00 1.85 1.59 0.82 3.38 2.48 16.50 964 86 108 488
OLO0173 Barolo 13.64 27.72 91.50 1.35 3.10 0.82 3.30 2.56 15.20 1038 111 116 402
OLO0273 Barolo 14.06 25.32 71.10 1.34 1.63 1.00 3.47 2.28 16.00 905 79 126 323
OLO0373 Barolo 12.93 28.80 102.10 1.05 3.80 0.89 3.26 2.65 18.60 915 79 102 294
OLO0473 Barolo 13.71 27.63 80.00 2.23 1.86 1.21 3.33 2.36 16.60 815 89 101 476
OLO0573 Barolo 12.85 25.80 69.60 1.54 1.60 0.79 3.45 2.52 17.80 958 101 95 415
OLO0673 Barolo 13.50 25.00 81.60 1.55 1.81 0.95 3.42 2.61 20.00 992 62 96 476
OLO0773 Barolo 13.05 25.72 78.30 1.15 2.05 1.08 3.57 3.22 25.00 1095 63 124 536
OLO0873 Barolo 13.39 27.10 72.30 1.52 1.77 1.05 3.46 2.62 16.10 936 68 93 395
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
OLO0973 Barolo 13.30 22.70 68.30 1.74 1.72 1.06 3.44 2.14 17.00 882 52 94 434
OLO1073 Barolo 13.87 29.30 68.30 1.38 1.90 0.75 3.42 2.80 19.40 1085 68 107 396
OLO1173 Barolo 14.02 25.20 69.60 1.71 1.68 0.79 3.26 2.21 16.00 780 62 96 510
GRI0170 Grignolino 12.37 18.30 90.10 2.80 0.94 0.73 3.11 1.36 10.60 580 77 88 296
GRI0270 Grignolino 12.33 22.90 72.20 2.25 1.10 0.69 3.26 2.28 16.00 715 85 101 365
GRI0370 Grignolino 12.64 23.90 95.70 1.93 1.36 1.06 3.19 2.02 16.80 688 83 100 395
GRI0470 Grignolino 13.67 22.20 64.80 2.20 1.25 0.74 3.40 1.92 18.00 725 51 94 301
GRI0570 Grignolino 12.37 23.50 70.00 2.06 1.13 0.72 3.30 2.16 19.00 785 73 87 422
GRI0670 Grignolino 12.17 23.03 65.70 1.84 1.45 0.72 3.35 2.53 19.00 790 62 104 411
GRI0770 Grignolino 12.37 26.80 62.70 1.70 1.21 0.88 3.40 2.56 18.10 978 55 98 310
GRI0870 Grignolino 13.11 23.70 80.00 1.40 1.01 0.77 3.10 1.70 15.00 730 80 78 297
GRI0970 Grignolino 12.37 20.90 63.70 1.94 1.17 0.67 3.40 1.92 19.60 785 40 78 212
GRI0171 Grignolino 13.34 23.72 70.00 2.02 0.94 1.09 3.26 2.36 17.00 760 64 110 451
GRI0271 Grignolino 12.21 22.70 90.70 3.62 1.19 0.94 3.14 1.75 16.80 795 134 151 448
GRI0371 Grignolino 12.29 21.40 55.60 1.43 1.61 0.87 3.54 2.21 20.40 682 102 103 324
GRI0471 Grignolino 13.86 25.25 59.50 1.27 1.51 1.09 3.63 2.67 25.00 785 63 86 383
GRI0571 Grignolino 13.49 22.30 60.90 1.74 1.66 0.67 3.44 2.24 24.00 680 60 87 300
GRI0671 Grignolino 12.99 26.10 50.50 1.42 1.67 1.24 3.52 2.60 30.00 974 55 139 473
GRI0771 Grignolino 11.96 24.50 65.70 2.18 1.09 0.73 3.40 2.30 21.00 681 98 101 366
GRI0871 Grignolino 11.66 20.30 61.70 1.70 1.88 0.60 3.30 1.92 16.00 785 52 97 312
GRI0971 Grignolino 13.03 23.50 78.60 1.90 0.90 0.76 3.30 1.71 16.00 790 57 86 396
GRI0172 Grignolino 11.84 26.40 108.70 1.70 2.89 0.91 3.11 2.23 18.00 790 71 112 350
GRI0272 Grignolino 12.33 20.60 58.70 2.41 0.99 0.84 3.32 1.95 14.80 680 124 136 438
GRI0372 Grignolino 12.70 27.15 93.30 1.46 3.87 1.11 3.19 2.40 23.00 890 110 101 321
GRI0472 Grignolino 12.00 23.20 58.40 1.88 0.92 0.82 3.30 2.00 19.00 680 63 86 408
GRI0572 Grignolino 12.72 22.90 58.40 1.40 1.81 0.81 3.50 2.20 18.80 890 83 86 418
GRI0672 Grignolino 12.08 23.50 56.90 1.33 1.13 0.71 3.65 2.51 24.00 980 85 78 215
GRI0772 Grignolino 13.05 25.50 104.80 1.64 3.86 0.73 3.19 2.32 22.50 938 98 85 195
GRI0173 Grignolino 11.84 23.40 70.80 1.80 0.89 1.00 3.40 2.58 18.00 922 80 94 378
(Continued)
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d
Name Category
Alcohol
(% abv)
Sugar-free
extract (g/l)
Fixed
acidity (g/l)
Tartaric
acid (g/l)
Malic
acid (g/l)
Uronic
acids (mg/l) pH
Ash
(g/l)
Alcalinity
of ash
(meq/l)
Potassium
(mg/l)
Calcium
(mg/l)
Magnesium
(mg/l)
Phosphate
(g/l)
GRI0273 Grignolino 12.67 24.30 74.10 1.70 0.98 0.88 3.35 2.24 18.00 840 81 99 336
GRI0373 Grignolino 12.16 25.80 78.90 1.84 1.61 0.78 3.37 2.31 22.80 845 98 90 285
GRI0473 Grignolino 11.65 22.90 62.90 1.80 1.67 0.64 3.55 2.62 26.00 1045 125 88 281
GRI0573 Grignolino 11.64 24.20 72.40 1.84 2.06 0.89 3.40 2.46 21.60 962 79 84 304
ERA0174 Barbera 12.86 26.80 87.30 0.99 1.35 0.92 3.22 2.32 18.00 830 52 122 266
ERA0274 Barbera 12.88 23.95 78.90 1.85 2.99 0.98 3.50 2.40 20.00 795 55 104 269
ERA0374 Barbera 12.81 24.45 76.20 2.93 2.31 0.87 3.64 2.40 24.00 785 49 98 266
ERA0474 Barbera 12.70 24.75 91.00 1.91 3.55 1.80 3.26 2.36 21.50 805 47 106 356
ERA0574 Barbera 12.51 23.50 104.70 1.34 1.24 0.98 3.50 2.25 17.50 975 60 85 273
ERA0674 Barbera 12.60 23.60 80.60 2.26 2.46 0.97 3.31 2.20 18.50 760 103 94 275
ERA0774 Barbera 12.25 25.30 91.40 1.42 4.72 1.25 3.40 2.54 21.00 995 105 89 262
ERA0874 Barbera 12.53 27.10 99.80 1.88 5.51 1.19 3.30 2.64 25.00 930 100 96 360
ERA0974 Barbera 13.49 25.70 115.50 2.17 3.59 1.47 3.24 2.19 19.50 825 111 88 315
ERA0176 Barbera 12.84 26.20 82.00 1.79 2.96 1.26 3.50 2.61 24.00 925 48 101 398
ERA0276 Barbera 12.93 26.78 80.00 1.69 2.81 1.15 3.31 2.70 21.00 965 40 96 351
ERA0376 Barbera 13.36 24.12 97.80 2.83 2.56 0.77 3.35 2.35 20.00 880 47 89 235
ERA0476 Barbera 13.52 27.90 85.00 1.46 3.17 1.23 3.28 2.72 23.50 880 38 97 325
ERA0576 Barbera 13.62 25.52 93.70 2.70 4.95 1.56 3.41 2.35 20.00 805 57 92 191
ERA0179 Barbera 12.25 23.40 113.50 3.54 3.88 1.04 3.01 2.20 18.50 785 77 112 358
ERA0279 Barbera 13.16 22.90 117.90 3.15 3.57 1.18 3.14 2.15 21.00 805 88 102 456
ERA0379 Barbera 13.88 21.40 99.30 2.81 5.04 1.29 3.28 2.23 20.00 750 43 80 171
ERA0479 Barbera 12.87 24.35 98.90 2.51 4.61 1.25 3.18 2.48 21.50 830 63 86 366
ERA0579 Barbera 13.32 21.46 96.90 2.85 3.24 1.75 3.30 2.38 21.50 790 42 92 306
ERA0178 Barbera 13.08 26.80 120.60 2.90 3.90 1.11 3.16 2.36 21.50 790 73 113 303
ERA0278 Barbera 13.50 26.50 105.50 2.31 3.12 1.31 3.23 2.62 24.00 980 67 123 338
ERA0378 Barbera 12.79 23.40 117.80 3.12 2.67 0.82 3.21 2.48 22.00 890 53 112 407
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
ERA0478 Barbera 13.11 25.20 95.40 2.26 1.90 0.86 3.49 2.75 25.50 1140 74 116 289
ERA0578 Barbera 13.23 23.85 120.60 2.80 3.30 0.80 3.20 2.28 18.50 915 68 98 351
ERA0678 Barbera 12.58 21.75 102.70 2.92 1.29 0.79 3.21 2.10 20.00 875 107 103 368
ERA0778 Barbera 13.17 23.20 129.30 2.28 5.19 1.49 3.58 2.32 22.00 1045 102 93 241
ERA0878 Barbera 13.84 24.70 122.90 2.76 4.12 1.07 3.19 2.38 19.50 840 108 89 402
ERA0978 Barbera 12.45 25.35 105.90 2.23 3.03 1.24 3.62 2.64 27.00 1050 118 97 393
ERA1078 Barbera 14.34 29.10 97.50 2.73 1.68 1.60 3.42 2.70 25.00 1095 78 98 462
ERA1178 Barbera 13.48 26.95 102.50 3.75 1.67 1.37 3.41 2.64 22.50 1055 79 89 480
mean 13.13 25.07 82.46 1.97 2.22 0.95 3.35 2.37 19.27 868.7 72.6 100.6 369.5
variance 0.59 5.25 332.34 0.35 1.27 0.07 0.02 0.08 13.24 13102.4 536.4 194.7 7453.5
standard deviation 0.77 2.29 18.23 0.59 1.12 0.26 0.14 0.28 3.64 114.5 23.2 14.0 86.3
Chloride
(mg/l)
Total
phenols
(g/l) Flavanoids
Non-
flavanoid
phenols
Pro-
anthocyanins
Color
intensity Hue
OD280/
OD315
of diluted
wines
OD280/
OD315
of flavanoids
Glycerol
(g/l)
2-3-
butanediol
(g/l)
Total
nitrogen
(mg/l)
Proline
(mg/l)
Methanol
(% A.A.)
82 2.80 3.06 0.28 2.29 5.64 1.04 3.92 4.77 9.29 757 153 1065 113
90 2.65 2.76 0.26 1.28 4.38 1.05 3.40 3.80 8.93 881 194 1050 94
67 2.80 3.24 0.30 2.81 5.68 1.03 3.17 3.46 11.74 900 206 1185 125
49 3.85 3.49 0.24 2.18 7.80 0.86 3.45 3.54 10.13 1119 292 1480 80
65 2.80 2.69 0.39 1.82 4.32 1.04 2.93 3.22 10.27 799 215 735 73
58 3.27 3.39 0.34 1.97 6.75 1.05 2.85 3.16 10.85 865 364 1450 68
52 2.50 2.52 0.30 1.98 5.25 1.02 3.58 3.94 9.05 931 378 1290 80
64 2.60 2.51 0.31 1.25 5.05 1.06 3.58 3.94 10.13 865 358 1295 100
58 2.80 2.98 0.29 1.98 5.20 1.08 2.85 3.03 9.89 825 438 1045 141
64 2.98 3.15 0.22 1.85 7.22 1.01 3.55 3.75 12.65 788 350 1045 121
61 2.95 3.32 0.22 2.38 5.75 1.25 3.17 3.27 8.59 964 378 1510 123
79 2.20 2.43 0.26 1.57 5.00 1.17 2.82 3.04 11.52 894 294 1280 134
257 2.60 2.76 0.29 1.81 5.60 1.15 2.90 2.92 12.24 784 289 1320 164
50 3.10 3.69 0.43 2.81 5.40 1.25 2.73 2.82 12.29 766 224 1150 105
(Continued)
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d
Chloride
(mg/l)
Total
phenols
(g/l) Flavanoids
Non-
flavanoid
phenols
Pro-
anthocyanins
Color
intensity Hue
OD280/
OD315
of diluted
wines
OD280/
OD315
of flavanoids
Glycerol
(g/l)
2-3-
butanediol
(g/l)
Total
nitrogen
(mg/l)
Proline
(mg/l)
Methanol
(% A.A.)
55 3.30 3.64 0.29 2.96 7.50 1.20 3.00 3.32 9.53 1041 324 1547 114
50 2.85 2.91 0.30 1.46 7.30 1.28 2.88 3.12 7.92 812 229 1310 97
62 2.80 3.14 0.33 1.97 6.20 1.07 2.65 3.10 9.24 836 308 1280 113
58 2.95 3.40 0.40 1.72 6.60 1.13 2.57 2.66 9.41 722 274 1130 99
28 3.30 3.93 0.32 1.86 8.70 1.23 2.82 3.17 9.85 808 230 1680 135
67 2.70 3.03 0.17 1.66 5.10 0.96 3.36 4.00 10.39 726 227 845 119
73 3.00 3.17 0.24 2.10 5.65 1.09 3.71 3.75 10.30 828 225 780 145
62 2.41 2.41 0.25 1.98 4.50 1.03 3.52 3.66 11.88 589 237 770 123
134 2.61 2.88 0.27 1.69 3.80 1.11 4.00 4.31 8.81 715 270 1035 109
73 2.48 2.37 0.26 1.46 3.93 1.09 3.63 3.82 9.14 568 248 1015 102
47 2.53 2.61 0.28 1.66 3.52 1.12 3.82 4.00 9.45 667 210 845 86
82 2.63 2.68 0.47 1.92 3.58 1.13 3.20 3.63 10.34 753 238 830 124
52 2.85 2.94 0.34 1.45 4.80 0.92 3.22 4.44 9.96 854 285 1195 85
46 2.40 2.19 0.27 1.35 3.95 1.02 2.77 3.10 9.70 757 350 1285 84
76 2.95 2.97 0.37 1.76 4.50 1.25 3.40 3.72 9.53 702 280 915 99
53 2.65 2.33 0.26 1.98 4.70 1.04 3.59 3.77 9.94 689 293 1035 100
52 1.98 0.57 0.28 0.42 1.95 1.05 1.82 2.12 5.40 736 287 520 98
108 2.05 1.09 0.63 0.41 3.27 1.25 1.67 1.42 6.90 658 345 680 127
53 2.02 1.41 0.53 0.62 5.75 0.98 1.59 1.86 8.20 691 321 450 60
47 2.10 1.79 0.32 0.73 3.80 1.23 2.46 1.73 8.60 797 262 630 87
306 3.50 3.10 0.19 1.87 4.45 1.22 2.87 3.07 7.20 748 141 420 157
116 1.89 1.75 0.45 1.03 2.95 1.45 2.23 2.73 7.50 627 219 355 58
69 2.42 2.65 0.37 2.08 4.60 1.19 2.30 2.60 7.96 680 259 678 118
148 2.98 3.18 0.26 2.28 5.30 1.12 3.18 3.33 8.20 604 100 502 114
54 2.11 2.00 0.27 1.04 4.68 1.12 3.48 4.07 7.10 554 425 510 98
111 2.53 1.30 0.55 0.42 3.17 1.02 1.93 1.92 8.10 704 363 750 137
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
88 1.85 1.28 0.14 2.50 2.85 1.28 3.07 3.23 6.81 714 195 718 116
50 1.10 1.02 0.37 1.46 3.05 0.91 1.82 2.00 6.38 661 301 870 78
59 2.95 2.86 0.21 1.87 3.38 1.36 3.16 3.52 7.62 748 170 410 99
43 1.88 1.84 0.27 1.03 3.74 0.98 2.78 3.50 8.04 614 160 472 64
35 3.30 2.89 0.21 1.96 3.35 1.31 3.50 3.60 8.00 731 293 985 113
48 3.38 2.14 0.13 1.65 3.21 0.99 3.13 3.15 7.70 563 183 886 57
59 1.61 1.57 0.34 1.15 3.80 1.23 2.14 2.35 6.14 596 109 428 129
122 1.95 2.03 0.24 1.46 4.60 1.19 2.48 2.85 8.40 756 167 392 145
58 1.72 1.32 0.43 0.95 2.65 0.96 2.52 3.25 5.22 514 246 500 122
99 1.90 1.85 0.35 2.76 3.40 1.06 2.31 2.70 7.96 654 259 750 59
52 2.83 2.55 0.43 1.95 2.57 1.19 3.13 3.82 8.66 700 227 463 101
27 2.42 2.26 0.30 1.43 2.50 1.38 3.12 3.52 5.80 645 199 278 95
64 2.20 2.53 0.26 1.77 3.90 1.16 3.14 3.33 7.38 664 199 714 111
53 2.00 1.58 0.40 1.40 2.20 1.31 2.72 3.50 8.11 548 203 630 115
48 1.65 1.59 0.61 1.62 4.80 0.84 2.01 2.07 8.64 649 207 515 114
95 2.20 2.21 0.22 2.35 3.05 0.79 3.08 3.81 6.36 586 138 520 141
70 2.20 1.94 0.30 1.46 2.62 1.23 3.16 3.60 7.90 600 217 450 121
54 1.78 1.69 0.43 1.56 2.45 1.33 2.26 2.92 8.04 643 195 495 116
36 1.92 1.61 0.40 1.34 2.60 1.36 3.21 3.27 9.54 608 262 562 120
70 1.95 1.69 0.48 1.35 2.80 1.00 2.75 3.60 7.97 523 223 680 120
46 1.51 1.25 0.21 0.94 4.10 0.76 1.29 1.26 6.43 673 252 630 122
72 1.30 1.22 0.24 0.83 5.40 0.74 1.42 1.34 10.10 918 319 530 102
67 1.15 1.09 0.27 0.83 5.70 0.66 1.36 1.24 10.02 1095 258 560 132
118 1.70 1.20 0.17 0.84 5.00 0.78 1.29 1.23 8.52 1020 238 600 121
29 2.00 0.58 0.60 1.25 5.45 0.75 1.51 1.40 8.32 764 178 650 79
77 1.62 0.66 0.63 0.94 7.10 0.73 1.58 1.37 6.47 573 174 695 100
144 1.38 0.47 0.53 0.80 3.85 0.75 1.27 1.12 8.25 680 217 720 107
6 1.79 0.60 0.63 1.10 5.00 0.82 1.69 1.80 8.35 821 230 515 139
56 1.62 0.48 0.58 0.88 5.70 0.81 1.82 2.23 10.40 700 245 580 150
(Continued)
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
TABLE 2.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d
Chloride
(mg/l)
Total
phenols
(g/l) Flavanoids
Non-
flavanoid
phenols
Pro-
anthocyanins
Color
intensity Hue
OD280/
OD315
of diluted
wines
OD280/
OD315
of flavanoids
Glycerol
(g/l)
2-3-
butanediol
(g/l)
Total
nitrogen
(mg/l)
Proline
(mg/l)
Methanol
(% A.A.)
15 2.32 0.60 0.53 0.81 4.92 0.89 2.15 2.25 10.60 940 269 590 132
25 1.54 0.50 0.53 0.75 4.60 0.77 2.31 2.34 10.62 955 260 600 82
71 1.40 0.50 0.37 0.64 5.60 0.70 2.47 2.60 10.41 814 216 780 106
21 1.55 0.52 0.50 0.55 4.35 0.89 2.06 2.21 10.20 976 201 520 118
16 2.00 0.80 0.47 1.02 4.40 0.91 2.05 2.55 8.90 899 205 550 140
14 1.38 0.78 0.29 1.14 8.21 0.65 2.00 2.23 8.16 521 218 855 97
17 1.50 0.55 0.43 1.30 4.00 0.60 1.68 2.24 5.61 696 252 830 63
10 0.98 0.34 0.40 0.68 4.90 0.58 1.33 1.81 7.94 670 156 415 154
50 1.70 0.65 0.47 0.86 7.65 0.54 1.86 2.10 8.52 806 213 625 122
21 1.93 0.76 0.45 1.25 8.42 0.55 1.62 2.19 6.12 604 219 650 106
50 1.41 1.39 0.34 1.14 9.40 0.57 1.33 1.26 7.36 733 164 550 114
106 1.40 1.57 0.22 1.25 8.60 0.59 1.30 1.29 6.28 568 129 500 107
127 1.48 1.36 0.24 1.26 10.80 0.48 1.47 1.40 7.00 898 154 480 91
55 2.20 1.28 0.26 1.56 7.10 0.61 1.33 1.25 8.57 905 249 425 125
35 1.80 0.83 0.61 1.87 10.52 0.56 1.51 1.42 10.80 915 154 675 84
100 1.48 0.58 0.53 1.40 7.60 0.58 1.55 1.34 7.52 924 142 640 100
84 1.74 0.63 0.61 1.55 7.90 0.60 1.48 1.31 9.50 969 207 725 84
6 1.80 0.83 0.48 1.56 9.01 0.57 1.64 1.92 9.29 902 159 480 132
53 1.90 0.58 0.63 1.14 7.50 0.67 1.73 2.18 10.20 865 252 880 118
49 2.80 1.31 0.53 2.70 13.00 0.57 1.96 2.25 10.82 764 223 660 182
35 2.60 1.10 0.52 2.29 11.75 0.57 1.78 2.09 11.09 1080 250 620 160
66.5 2.24 1.90 0.36 1.51 5.27 0.97 2.51 2.75 8.79 759.7 240.4 779.3 110.2
1977.3 0.40 0.99 0.02 0.35 4.90 0.06 0.62 0.87 2.76 20104.3 4691.8 102493.8 649.9
44.5 0.63 1.00 0.13 0.59 2.21 0.25 0.78 0.93 1.66 141.8 68.5 320.1 25.5
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0130 Figure 2.1 shows examples of histograms for
a portion of the data given in Table 2.1, namely
for variables number 13, 21, and 26.
p0135 Three typical patterns are noticeable. In
particular, variable 13 (phosphate) shows
a unimodal and almost symmetric shape, which
may suggest that such variable follows a normal
probability distribution (Fig. 2.1a).
p0140 Conversely, variable 21 (OD280/OD315 of
diluted wines) presents a bimodal distribution,
which may suggest that this variable to be char-
acterized by different average values for diverse
sample classes (Fig. 2.1b). In such cases, histo-
grams could be drawn for each class separately,
to verify the trend of the within-class
distributions.
p0145Instead, the histogram shape for variable
26 (proline) reveals an underlying asymmetric
distribution (Fig. 2.1c). It is possible to
convert such behavior into an almost normal
150 200 250 300 350 400 450 500 550 600
0
1
2
3
4
5
6x 10
-3
Phosphate
Relative Frequenc y
(a)
11.5 22. 5 33.5 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
OD280/OD315
Relative Frequenc y
(b)
200 400 600 800 1000 1200 1400 1600 1800
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Proline
Relative Frequenc y
x 10
-3
(c)
f0010 FIGURE 2.1 Histograms for three variables of Table 2.1: phosphate (a), OD280/OD315 of diluted wines (b), and proline (c).
2.2. FROM UNIVARIATE TO MULTIVARIATE 35
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
one, simply by applying a logarithmic trans-
formation to the variable, as it is shown in
Fig. 2.2.
s0030 2.2.2. Normality Tests
p0150 Assessing for compatibility with a normal
distribution is a basic issue in data analysis,
because many methods require variables to be
normally distributed. As observed, frequency
distributions may be employed for this purpose.
Visual examination of histogram shapes may
supply a preliminary evaluation. Besides, the
cumulative empirical frequency distributions
(EFDs) constitute the basis for a family of statis-
tical normality tests, which are usually referred
to as KolmogoroveSmirnov tests (Kolmogorov,
1933; Smirnov, 1939).
p0155 One of the most effective and employed,
among them, is the Lilliefors test, which may
be used for generally assessing how well an
empirical distribution fits with a theoretical
one (Lilliefors, 1970). In the case for normality
verification, the null hypothesis (H
0
) is that the
observed empirical frequency distribution for
a given variable is not significantly different
from the theoretical normal probability distribu-
tion, at a given significance level. The alternative
hypothesis (H
1
) is that the observed EFD is not
compatible with the theoretical normal distribu-
tion, at that significance level.
p0160The test procedure consists in ordering the
values of the variable to be tested and normal-
izing them by means of a Student’s transforma-
tion (or autoscaling):
x
i;v¼xi;vxv
sv
(2.1)
The variable is corrected by subtracting its
mean (xv)fromeachofitsvaluesandthen
dividing by its standard deviation (s
v
). The
autoscaled variable is dimensionless and
present mean equal to 0 and standard deviation
equal to 1.
p0165Then, the corresponding cumulative theoret-
ical probability distribution is estimated from
the statistical parameters computed, and the
maximumdistancefromsuchhypothesized
distribution and the empirical one is calcu-
lated. This value is compared with a critical
distance value, at a predetermined significance
level, and such comparison determines the
acceptance/rejection of the null hypothesis.
The critical values, which depend on the
sample size, were obtained by Monte Carlo
simulations and are available on tables or
statistical software.
p0170The Lilliefors test can be performed also in
a graphical way (Iman, 1982), as it is illustrated
in Figs 2.3 and 2.4 for the same cases of Figs 2.1
and 2.2.
p0175Charts for the Lilliefors test report the cumu-
lative empirical frequency distributions (EFDs)
for variables number 13, 21, and 26 of Table 2.1,
after column autoscaling (red lines in Figs 2.3
and 2.4), together with the cumulative theoret-
ical probability distribution (blue solid lines),
and the distance limits according to the
Lilliefors test, at a 5% significance (blue dot
lines). When the EFD curve intersects at least
one of the limits individuated by the critical
5.5 6 6.5 7 7.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
log Proline
Relative Frequency
f0015 FIGURE 2.2 Histogram for the Log-transformed variable
proline of Table 2.1.
2. DATA ANALYSIS AND CHEMOMETRICS36
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
distance, the null hypothesis is rejected. As for
the examples reported in Fig. 2.3, the null
hypothesis is accepted only for the variable
phosphate, while for both the other variables
examined, it is rejected at the same significance
level. In fact, only in the first case (Fig. 2.3a), the
red EFD line does not intersect the critical
distance lines in any point.
p0180 In addition, it can be easily verified econ-
firming the deductions made by looking at the
histogram of Fig. 2.2 ethat the logarithmic
transformation applied to the variable proline
makes it compatible with the normal distribu-
tion (see Fig. 2.4).
s00352.2.3. ANOVA
p0185Analysis of variance (ANOVA) is the name of
a group of statistical methods based on Fisher ’s
Ftests, generally aimed at verifying the
existence/absence of significant differences
between groups of data. The null hypothesis
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Phosphate (autoscaled)
Cumulative di stribution
empirical frequency distribution
theoretical probability distribution
critical distance (α=5%)
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
OD280/OD315 of diluted wines (autoscaled)
Cumulative di stribution
empirical frequency distribution
theoretical probability distribution
critical distance (α=5%)
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Proline (autoscaled)
Cumulative distribution
empirical frequency distribution
theoretical probability distribution
critical distances (α=5%)
(a)
(c)
(b)
f0020 FIGURE 2.3 Graphical Lilliefors normality test for three variables of Table 2.1: phosphate (a), OD280/OD315 of diluted
wines (b), and proline (c).
2.2. FROM UNIVARIATE TO MULTIVARIATE 37
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
H
0
is that all the data derive from the same
stochastic population, i.e., there is no significant
difference between the groups considered. In
order to verify this hypothesis, the final Ftest
compares the variability between groups with
the variability within groups (Box et al, 1978).
p0190 The simplest case is the one-way ANOVA,
whose procedure is described with a real
numerical example. The two columns of Table
2.2 report the values of the alcoholic degree for
Barolo and Barbera wine samples of the red-
wine data set (Table 2.1), respectively, and
some basic descriptive parameters. A summary
of all the parameters computed for the ANOVA
test is given in Table 2.3. The aim is to assess
whether there is a significant difference between
the alcohol content of the two wines or not. In
fact, although the mean Barbera alcoholic
degree (13.07% abv) is noticeably minor than
the corresponding Barolo value (13.83% abv),
the two respective ranges overlap, so that it
might be suspected the observed difference to
be due to chance variations.
p0195 The within-columns variance can be computed
as a pool variance, under the hypothesis that the
variances of the different groups are homoge-
neous. When only two groups eand, conse-
quently, two variances eare being compared,
a Fisher’s Ftest is suitable to verify this prelimi-
nary hypothesis. In the given numerical example,
the test value is computed as
Ft¼s2
Barolo
s2
Barbera
¼0:274
0:254 ¼1:08 (2.2)
The Fcritical value, at a 5% right significance
level and for 29 degrees of freedom (d.o.f.)
both at the numerator and at the denominator,
is 1.86. So it is possible to conclude that the vari-
ances of the two groups considered are not
significantly different, at a 5% right significance
level.
p0200In problems involving more than two
groups, the comparison among variances
can be performed with multiple Ftests on
all the possible pairs, or by means of the
Cochran’s test or the Bartlett’s tests (Snedecor
and Cochran, 1989). The former is valid when
there is an equal numbers of data in each
group, while the latter has a wider
applicability.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Log Proline (autoscaled)
Cumulative distribution
empirical frequency dist ribution
theoretical probabilit y distribution
critic al distance (α=5%)
f0025 FIGURE 2.4 Graphical Lilliefors normality test for the Log-transformed variable proline of Table 2.1.
2. DATA ANALYSIS AND CHEMOMETRICS38
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0205In the numerical example discussed, the
within-columns variance ecomputed as pooled
variance ecorresponds to
s2
within ¼PC
c¼1PNC
n¼1ðxnc xcÞ2
NC
¼15:314
58 ¼0:264
(2.3)
p0210Conversely, the between-columns variance is
computed as
s2
between ¼PC
c¼1NCðxcxÞ
C1
¼8:786
1¼8:786
(2.4)
p0215Finally, the ANOVA Ftest value is computed
as between-columns variance:within-columns
variance ratio, to perform the final test which
compares these variations, for the verification
of the ANOVA null hypothesis:
FANOVA ¼s2
between
s2
within
¼8:786
0:264 ¼32:28 (2.5)
p0220The Fcritical value, at a 5% right signifi-
cance level, for 1 degree of freedom at the
numerator and 58 degrees of freedom at the
t0015 TABLE 2.2 Alcohol Content (% abv) for Barolo and
Barbera Samples of Data Set Red Wines,
and Basic Statistical Parameters
Barolo Barbera
14.23 12.86
13.20 12.88
13.16 12.81
14.37 12.70
13.24 12.51
14.20 12.60
14.39 12.25
14.06 12.53
14.83 13.49
13.86 12.84
14.10 12.93
14.12 13.36
13.75 13.52
14.75 13.62
14.38 12.25
13.63 13.16
14.30 13.88
13.83 12.87
14.19 13.32
13.64 13.08
14.06 13.50
12.93 12.79
13.71 13.11
12.85 13.23
13.50 12.58
13.05 13.17
13.39 13.84
13.30 12.45
13.87 14.34
14.02 13.48
Min 12.85 12.25
Max 14.83 14.34
Mean 13.83 13.07
Variance 0.274 0.254
Standard deviation 0.524 0.504
t0020TABLE 2.3 Full ANOVA Parameters for the Data
given in Table 2.2. Computed Fratio
(from variances of columns Barolo and
Barbera) ¼1.08. Critical Fvalue (at 5%
significance) ¼1.86. Ftest on variances
of columns Barolo and Barbera:
significance ¼41.8%
½
AU4
Source of variation d.o.f. Sum of squares Variance
Total 60 10874.485
Mean 1 10850.384
Between columns 1 8.786 8.786
Within columns 58 15.314 0.264
Computed Fratio ¼33.28; Critical Fvalue (at 5% significance) ¼4.01;
ANOVA Ftest: significance ¼0.0%.
2.2. FROM UNIVARIATE TO MULTIVARIATE 39
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
denominator, is 4.01. From the comparison
with the computed test value, it follows that
the null hypothesis is rejected at a 5% signif-
icance level. The conclusion is that the differ-
ence between alcoholic content of Barolo
and Barbera samples is significantly larger
than the variability within each of the two
groups.
p0225 ANOVA tests can be applied also when the
effect of two variability sources (e.g., type of
wine and vintage year) is to be verified: such
a scheme is usually called a two-way ANOVA.
When a number of replicate measurements are
available for each level combination of the two
factors (nested two-way ANOVA), the model
obtained also allows an estimation of the inter-
action between the factors, together with its
significance.
s0040 2.2.4. Radar Charts
p0230 Radar charts ealso known as web charts,
spider charts, star charts, cobweb charts, polar
charts, star plots, or Kiviat diagrams eare
a data display tool that can be considered as
a sort of link between univariate and multivariate
graphical representations (Chambers et al., 1983).
p0235 They consist of circular graphs divided into
a number of equiangular spokes, called radii.
Each radium represents one of the variables.
A point is individuated on it, whose distance
from the center is proportional to the magnitude
of the related variable for that datum. Finally, all
the data points ecorresponding to all the vari-
ables measured on a sample eare connected
with a line, which represent a sort of sample
profile.
p0240 Usually, each plot represents a single sample,
and multiple observations are compared by
examining different plots. It is also possible to
overdraw several lines on the same chart,
although the outcome will be legible only for
small data sets. As a matter of fact, when the
number of samples is large, such graphical
representation is generally not very functional.
p0245Within radar charts, variables can be repre-
sented without any previous scaling, revealing
what variables are dominant for a given data
set. Nonetheless, when variables are character-
ized by considerably different scales (as in the
case for red-wine data of Table 2.1), a prelimi-
nary transformation may be helpful in order to
make visible within the graph the contribution
of all of them, by assuring the same a priori
importance.
p0250For instance, by looking at Fig. 2.5, it clearly
appears that, without any scaling, four features
are dominating, corresponding to the variables
number 10, 13, 24, and 16, which are character-
ized by the highest mean values (see Table
2.2). The contribution of the remaining 23 vari-
ables is not recognizable within these graphs.
Furthermore, it is not possible to draw many
valuable considerations about the sample
profiles. In particular, it can be noticed that
Grignolino wines (Fig. 2.5a) are characterized,
on average, by smaller values for the four
observable variables. It can also be deduced
that Barolo has a higher contribution from vari-
able 26, while Barbera (Fig. 2.5c) has higher
contributions from variables 24 and 10.
p0255On the other hand, Fig. 2.6 illustrates that,
after application of column autoscaling (see
Eqn (2.1)), the a priori differences in location
and dispersion among the original variables
are eliminated, thus showing the contribution
of all of them and highlighting the differences
among the observations. In fact, in this second
graph, the profiles of the three wines appear
much more dissimilar than in the previous
one. By a joint examination of the three radar
charts of Fig. 2.6, it can be deduced that Barolo
and Barbera samples present two rather
complementary profiles, while the Grignolino
profile is somewhat intermediate. In particular,
Barolo (Fig. 2.6a) is characterized by higher
average values of variables 1, 2, 13, 15, 16, 18,
21, 22, 23, and 26. Instead, Grignolino
(Fig. 2.6b) presents average lower values of all
the variables, except for the number 20. Finally,
2. DATA ANALYSIS AND CHEMOMETRICS40
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
Barbera (Fig. 2.6c) has a bigger contribution
from variables 3, 4, 5, 9, 17, 19, and 24.
p0260 Such deductions may be useful for character-
ization purposes.
s0045 2.3. MULTIVARIATE DATA
ANALYSIS
s0050 2.3.1. Principal-Component Analysis
p0265 Principal-component analysis (PCA), which
originates in the work of K. Pearson (1901), is
one of the basic and most useful tools in the
branch of multivariate analysis. It is an explor-
atory method, which always offers an overview
of the problem studied and often allows the
drawing of significant conclusions and for deci-
sions to be made on the basis of the observed
results. Furthermore, PCA can be employed
for feature and noise reduction purposes and
constitutes the basis for other more complex
pattern recognition techniques.
p0270PCA is based on the assumption that a high
variability (i.e. a high variance value) is synony-
mous with a high amount of information.
f0035 FIGURE 2.6 Radar charts of average profiles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27
correspond to the variables listed in Table 2.1, reprocessed by application of column autoscaling.
f0030 FIGURE 2.5 Radar charts of average profiles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27
correspond to the original variables listed in Table 2.1.
2.3. MULTIVARIATE DATA ANALYSIS 41
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0275 For this reason, PCA algorithms search for
the maximum variance direction, in the multidi-
mensional space of the original data, preferably
passing through the data centroid, which means
that data have to be at least mean centered
column-wise. The maximum variance direction
represents the first principal component (PC).
The second PC is the direction which keeps
the maximum variance among all directions
orthogonal to the first PC. It follows that the
second PC explains the maximum information
not explained by the first one or, in other words,
these two new variables are not inter-correlated.
The process continues with the identification of
the subsequent PCs: it may stop at reaching
a variance cutoff value or continue until all the
variability enclosed in the original data has
been explained (Jolliffe, 2002).
p0280 Since the variance values depend on the
scale of the variables, it becomes difficult to
compare and impossible to combine informa-
tion from variables of different nature, unless
properly normalized: column autoscaling (see
Eqn (2.1)) is the most commonly applied
transform.
p0285 Each sample can be projected in the space
defined by the new variables: the coordinate
values obtained are called scores.
p0290 The PCs are expressible as linear combina-
tions of the original variables: the coefficients
which multiply each variable are called load-
ings. They represent the cosine values (director
cosines) of the angles between the PCs and the
original variables. These values may vary
between e1 and þ1, indicating the importance
in defining a given PC: the larger the cosine
absolute value, the closer the two directions,
thus the larger the contribution of the original
variable to the PC.
p0295 In terms of matrix algebra, the rotation from
the space of the original variables to the PC
space is performed by means of the loading
orthogonal matrix, L:
SNV ¼XNV LVV (2.6)
where Sis the score matrix and Xis the original
matrix, constituted by Nobjects (rows)
described by Vvariables (columns).
p0300One of the key features of PCA is its high
capability for representing large amounts of
complex information by way of simple bidimen-
sional or tridimensional plots.
p0305In fact, the space described by two or three
PCs can be used to represent the objects (score
plot), the original variables (loading plot), or
both objects and variables (biplot) (Geladi
et al., 2003; Kjeldahl and Bro, 2010). Since prin-
cipal components are not inter-correlated vari-
ables, no duplicate information is shown in PC
plots.
p0310Figure 2.7 represents an example of a highly
informative biplot, which derives from PCA
performed on the red-wine data set given in
Table 2.1. Data have been previously autoscaled,
in order to eliminate the magnitude differences
among the variables. The two Cartesian axes
correspond to the first (meaning low-order)
two PCs, which together show a 44.2% of the
information (defined as explained variance)
enclosed in the original multidimensional data
space.
p0315The plot clearly shows the interrelations
existing among samples, among variables, and
between samples and variables. Moreover,
considering also the additional row information
given in Table 2.1 (namely, the class of each
sample, graphically represented by different
colors), it is possible to get information about
the discrimination among the three wine cate-
gories (Barolo, Grignolino, and Barbera), the
dispersion of samples within each class, and
the discriminatory importance of the variables
measured.
p0320In particular, it appears that PC1 ewhich
accounts for the 27.4% of the total variance,
i.e., information eis a direction effective in dis-
tinguishing among the three wine classes, espe-
cially between Barolo (green scores) and
Barbera (blue scores) samples. Instead, PC2
(explaining the 16.8% of total variance) is useful
2. DATA ANALYSIS AND CHEMOMETRICS42
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
in differentiating mainly Grignolino samples
(red scores) from the other two groups.
p0325 The variables which present the highest
loading absolute value on PC1 are the numbers
15, 16, 21, 22, 3, 4, 5, 6, 9, and 17. It means that
such variables are the most important in
defining PC1 and, consequently, in discrimi-
nating among the three wine classes, particu-
larly Barolo from Barbera. In more detail,
looking at the correspondences between scores
and loadings, it can be deduced by this plot
that the samples that present on average the
highest values for variables 3, 4, 5, 6, 9, and 17
and the lowest values for variables 15, 16, 21,
and 22 belong to class Barbera. Just opposite
considerations are applicable for the samples
of class Barolo (low values for the variables of
the first group and high values for the variables
of the second group). Instead, Grignolino wines
lay in a halfway position, meaning that they are
characterized by intermediate values for all
these variables. Conversely, variable number
20 has the highest loading value on PC2 and it
is clearly in the same direction of the Grignolino
cluster. This means that Grignolino wines have,
on average, high values of variable 20. Opposite
considerations are valid for variables 1, 2, 8, 10,
19, 23, and 24.
p0330All of these inferences are absolutely in
accord with all the deductions already made
by inspection of the radar charts in Fig. 2.6.
Moreover, the information that variable 21
(OD280/OD315 of diluted wines) has a highly
discriminant power is in perfect agreement
with the bimodal distribution observed in histo-
gram of Fig. 2.1b for the same variable.
f0040 FIGURE 2.7 Example of PCA biplot for the autoscaled red-wine data. The scores (colored bottles) correspond to the wine
samples of classes Barolo (green), Grignolino (red), and Barbera (blue), respectively. The loadings (line segments and
numbers from 1 to 27) represent the contribution of the original variables eas listed in Table 2.1 eto the information
visualized in the plot.
2.3. MULTIVARIATE DATA ANALYSIS 43
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0335 Variables 7, 12, 25, and 27 have very small
loading values on both the PCs, meaning that
such variables give a negligible contribution to
the portion of information visualized in this
plot.
p0340 Further considerations may be drawn by
looking at the distribution of samples inside
each class. For instance, it can be easily seen
that Barolo wines are characterized by the
lowest within-class variability. In the case of
quality control, a low sample variability about
a target value is an index of high quality (Tagu-
chi, 1986).
p0345 It is worth noticing that a single and simple
biplot is able to report a considerable amount
of information that would require a large
number of univariate plots and tests to be
extracted: PCA is, without any doubt, the most
efficient way to account for the information
enclosed in a data table.
s0055 2.3.2. Signal Pre-Processing
p0350 Instrumental analytical techniques com-
monly provide information in the form of digital
signals. Spectra, chromatograms, and voltam-
mograms are typical examples. Such signals
generally require to be suitably pretreated, since
the analytical information is not the exclusive
component. A number of different variations,
from sources other than the analytical system
under investigation, generally affect signals.
They may be related either with the electric
instrumentation components or with the
surroundings. In particular, unwanted signal
variations may be random or systematic. The
former are generally due to sporadic interfer-
ences or associated with random phenomena
(e.g., Brownian motions of particles and thermal
motion of electrons ethe so-called
JohnsoneNyquist noise) which usually follow
a standard normal or a Poisson probability
distribution. This type of noise, also called white
noise, is characterized by frequency values
higher than those of the useful signal.
p0355Instead, systematic unwanted variations are
commonly due to instrumental trends or to
external influences. They may affect the signal
with baseline shifts and/or drifts, which can
be considered as a low-frequency contribution.
p0360Signal processing is generally aimed at mini-
mizing the unwanted variations, thus
improving the quality of signals and, conse-
quently, the conversion of data to valuable
information.
p0365In particular, it is possible to individuate
three main objectives: reduction of random
noise, reduction of systematic unwanted varia-
tions, and reduction of data size.
p0370Several pre-processing techniques accom-
plish with more than one point. Furthermore,
in some cases, the transformation itself facili-
tates the interpretation of complex signals, as
in the case for derivatives.
p0375When several digital signals are structured
into a data matrix, each of them corresponding
to a row efollowing the chemometric conven-
tion esignal pre-processing is also known as
row pre-processing. The mathematical trans-
forms act on each single signal, independently
of the others.
p0380Techniques for reduction of random noise
include the moving average eor boxcar e
filters, the SavitzkyeGolay smoothing (Savitzky
and Golay, 1964), and the Fourier transform
(FT)-based filters (Reis et al., 2009).
p0385Regarding the elimination or minimization
of unwanted systematic effects, a number of
mathematical methods for signal transforma-
tion are widely employed, such as the
standard normal variate (SNV) transform and
derivatives.
s00602.3.2.1. Standard Normal Variate (SNV)
Transform
p0390The SNV transform, or row autoscaling, is
particularly applied in spectroscopy, since it is
useful to correct for both baseline shifts and
global intensity variations (Barnes et al., 1989).
Each signal (x
i
) is row-centered, by subtracting
2. DATA ANALYSIS AND CHEMOMETRICS44
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
its mean (xi) from each single value (x
i,v
), and
then scaled by dividing by the signal standard
deviation (s
i
):
x
i;v¼xi;vxi
si
(2.7)
p0395 After transformation, each signal presents
mean equal to 0 and standard deviation equal
to 1.
p0400 SNV has the peculiarity of possibly shifting
informative regions along the signal range, so
that the interpretation of the results referring
to the original signals should be performed
with caution (Fearn, 2009).
s0065 2.3.2.2. Derivatives
p0405 The numerical differentiation of digitized
signals may correct for baseline shifts and drifts,
depending on the derivation order. Further-
more, derivative profiles often exhibit an
increased apparent resolution of overlapping
peaks and may accentuate small structural
differences between nearly identical signals
(Taavitsainen, 2009).
p0410 The first derivative of a signal y¼f(x)isthe
rate of change of ywith x(i.e.,y0¼dy/dx),
which can be interpreted eat the single
points eas the slope of the line tangent to
the signal. It provides a correction for baseline
shifts.
p0415 The second derivative can be considered as
a further derivation of the first derivative
(y00 ¼d
2
y/dx
2
); it represents a measure of the
curvature of the original signal, i.e. the rate of
change of its slope. Such transform provides
a correction for both baseline shifts and drifts.
p0420 A disadvantageous consequence of deriva-
tion may be an enhancement of the random
noise, characterized by high-frequency slope
variations. To overcome this hurdle, signals are
firstly smoothed, often by using the
SavitzkyeGolay algorithm (Savitzky and Golay,
1964) with a third-order polynomial.
s00702.3.2.3. Horizontal Alignment
p0425When a series of chromatograms are used
as vectors to build a data matrix, a typical
problem arises with the horizontal shifts
that commonly characterizes such type of
data.
p0430The most common methods for peak align-
ment, such as the correlation optimized warp-
ing (COW), search for the maximum
correlation between a selected reference profile
and a series of piecewise modified (shifted and
warped) versions of the signals to be aligned
(Nielsen et al., 1998; Jellema, 2009).
s00752.3.3. Supervised Data Analysis
and Validation
p0435Exploratory techniques for data analysis,
such as PCA, are unsupervised, meaning that
they just show the data as they are. Conversely,
supervised chemometric methods look for
determined features within data, explicitly
oriented to address particular issues.
p0440In particular, when a model is developed
with the purpose of predicting a qualitative or
quantitative property of interest, its reliability
in prediction should be assessed prior to using
the model in practice. Prediction ability values
should be presented together with their confi-
dence interval (Forina et al., 2001; Forina et al.,
2007), which depends on the number of samples
used for the validation. The estimation of the
predictive ability on new samples enot used
for building the models eis a fundamental
step in any modeling process and several proce-
dures have been deployed for this purpose. The
most common validation strategies divide the
available samples into two subsets: a training
(or calibration) set used for calculating the
model and an evaluation set used for assessing
its reliability. A key feature for an honest valida-
tion is that the test samples have to be absolutely
extraneous to the model: no information from
them can be used in building the model, neither
2.3. MULTIVARIATE DATA ANALYSIS 45
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
in the pre-processing stages, otherwise the
prediction ability may be overestimated.
p0445 In many modeling techniques, some parame-
ters are optimized looking for a setting that
provides the maximum predictive ability for
the model for a given sample subset.
p0450 In such cases, a correct validation strategy
would involve three sample subsets: a training
set, an optimization set, and an evaluation set.
The optimization set is used to find the best
modeling settings, while the actual reliability
of the final model is estimated by way of a real
prediction on the third subset, formed by objects
that have never influenced the model.
p0455 The evaluation of the predictive ability of
a model can be performed in a unique step or
many times with different evaluation sets,
depending on the strategy adopted.
s0080 2.3.3.1. Single Evaluation Set
p0460 A single evaluation set is the simplest and
most rapid validation scheme. A fraction e
usually between 50% and 90% of the total
number eof the available samples constitutes
the training set, while the remaining objects
form the evaluation set. The subdivision may
be arbitrary, random, or performed by way of
a uniform design, such as the Kennard and
Stone and the duplex algorithm (Kennard and
Stone, 1969; Snee, 1977), which allows two
subsets that are uniformly distributed and
representative of the total sample variability to
be obtained.
s0085 2.3.3.2. Cross-Validation (CV)
p0465 Cross-validation is probably the most
common validation procedure. The Navailable
samples are divided into Gcancellation groups
following a predetermined scheme (e.g., contig-
uous blocks or Venetian blinds). The model is
computed Gtimes: each time, one of the cancel-
lation groups is used as the evaluation set, while
the other groups constitute the training set. At
the end of the procedure, each sample has
been used Ge1 times for building a model
and once for evaluation. The number of cancel-
lation groups usually ranges from 3 to N.
Cross-validation with Ncancellation groups is
generally known as the leave-one-out procedure
(LOO). LOO has the advantage of being unique
for a given data set, whereas, when G<N,
different orders of the samples and different
subdivision schemes generally supply different
outcomes. However, especially when the total
number of samples is considerable, predictions
made on a unique object, although repeated
many times, may yield an overly optimistic
result. An extensive evaluation strategy consists
in performing cross-validation many times,
with a different numbers of cancellation groups,
from 3 up to N. Another possibility is to repeat
the validation, for a given number G<Nof
cancellation groups, each time permuting the
order of the samples, thus obtaining a different
group composition each time.
s00902.3.3.3. Repeated Evaluation Set
p0470This procedure, also called Monte Carlo vali-
dation, computes many models (often many
thousands), each time creating a different evalu-
ation set, with a variable number of samples, by
random selection. Each sample may fall many
times, or even no times at all, in the evaluation
set. The main drawback of this validation
strategy is the longer computational time.
s00952.3.4. Supervised Qualitative Modeling
s01002.3.4.1. Classification and Class-Modeling
p0475A wide number of issues within food science
require qualitative answers. It is the case for the
characterization of ingredients or finished
products, the verification of the geographical
origin, and emore generally ethe quality
control, the control on food adulterations, and
so on.
p0480Discriminant classification and class-
modeling techniques represent the most
common chemometric tools for addressing
such aims. In fact, they build mathematical rules
2. DATA ANALYSIS AND CHEMOMETRICS46
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
or models able to characterize a sample with
respect to a qualitative property, namely the
class to which it belongs.
p0485 A class (or category) is defined as a group of
samples having in common the same values of
discrete variables or proximate values of contin-
uous variables. Frequently, such variables are
qualitative factors that cannot be determined
experimentally, so that they have to be esti-
mated from the values of some experimentally
measurable predictors, by way of suitable math-
ematical tools.
p0490 In more detail, discriminant classification
techniques are able to determine to which class
a sample more probably belongs, among
a number of predefined classes. They work by
building a delimiter between the classes and,
then, each new object is always assigned to the
category to which it more probably belongs,
even in the case of objects which are not perti-
nent to any class studied.
p0495 Instead, class-modeling techniques verify
whether a sample is compatible or not with the
characteristics of a given class of interest. In fact,
they provide an answer to the general question:
“Is sample X, claimed to belong to class A, actu-
ally compatible with the class A model?”. This is
essentially the question to be answered in most
of the real qualitative problems studied within
the food sciences. Such an approach is also
capable of detecting outliers (Forina et al., 2008).
s0105 2.3.4.2. Evaluation Parameters
p0500 The effectiveness of a classification rule is
usually evaluated by the classification rate, i.e.,
the percentage of objects correctly classified.
This parameter is often indicated as prediction
rate, when it is estimated by means of an evalu-
ation sample set. A class rule can be considered
valuable when the prediction rate should be
significantly bigger than the null-classification
rate, which is defined as the probability
percentage of chance correct assignments, and
corresponds to 100% divided by the number of
categories.
p0505A class model is characterized by two param-
eters: sensitivity and specificity. Sensitivity is
defined as the percentage of objects belonging
to the modeled class which are rightly accepted
by the model. Specificity is the percentage of
objects not belonging to the modeled class
which is rightly rejected by the model. A class-
modeling technique builds a class space, whose
wideness corresponds to the confidence
interval, at a pre-selected confidence level, for
the class objects: sensitivity is an experimental
measure of this confidence level. A decrease in
the confidence level for the modeled class gener-
ally reduces the sensitivity and increases the
specificity of the model. Frequently, in order to
evaluate the model performance, taking into
account these features, an efficiency parameter
is computed as the geometric mean of sensi-
tivity and specificity.
p0510When at least two classes are modeled, the
results of the class-modeling analysis can be
visualized by way of the Coomans’ plots
(Coomans et al., 1984). Such graphs represent
the samples in relation to the distances from
the models of two given classes. Often, the
distances are normalized dividing by the critical
distance value that characterizes the corre-
sponding model.
p0515In the example given in Fig. 2.8, the two
Cartesian axes correspond to the distances
from the model of class Barolo and from the
model of class Grignolino (data set red wines),
respectively, while two straight lines parallel to
the axes describe the limits of the corresponding
class spaces at a 95% confidence level. The plot
area is divided into four regions, which contain
respectively: the samples accepted by the model
of class Barolo (upper left rectangle), the
samples accepted by the model of class Grigno-
lino (lower right rectangle), the samples
accepted by both the models (lower left square),
and the objects rejected by both the models
(upper right square). All the samples belonging
to the class Barbera correctly lay inside such
last area.
2.3. MULTIVARIATE DATA ANALYSIS 47
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0520 Classification and class-modeling techniques
belong to three main families:
u0010 •distance-based techniques
u0015 •probabilistic techniques
u0020 •experience-based techniques.
s0110 2.3.4.3. Distance-Based Techniques
s0115 2.3.4.3.1. KNEAREST NEIGHBORS (K-NN)
p0540 k-NN is one of the simplest approaches for
classification (Vandeginste et al., 1998). As the
first step, k-NN computes the distances of the
test sample from each of the samples of
a training set, whose class membership is
known. Usually, the multivariate Euclidean
distance is employed:
Di;j¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X
V
v¼1
ðxi;vxj;vÞ2
v
u
u
t¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðxixjÞ0ðxixjÞ
q
(2.8)
f0045 FIGURE 2.8 Example of Coomans’ plot for the data set red wine given in Table 2.1. The samples are represented by class
colored symbols: green (Barolo), red (Grignolino), and blue (Barbera).
with x
i
and x
j
being two sample data vectors,
respectively, and x
i,v
the value of variable vfor
sample i.
p0545In some cases, the Mahalanobis distance can
be used. It can be considered as an Euclidean
distance modified for taking into account the
dispersion and the correlation of all of the
variables:
Di;j¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðxixjÞ0V1ðxixjÞ
q(2.9)
where Vis the covariance matrix.
p0550Once the matrix of distances between objects
has been computed, the ksamples nearest to
the test sample are then taken into consider-
ation to perform the classification: generally,
a majority vote is employed, meaning that the
new object is classified into the class mostly
represented within the kselected objects. Being
a distance-based method, it is sensitive to the
2. DATA ANALYSIS AND CHEMOMETRICS48
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
measurement units and to the scaling proce-
dures applied.
p0555 The method provides a nonlinear delimiter
between categories, generally expressible as
a piecewise linear function (see Fig. 2.9). The
delimiter usually becomes more smoothed for
elevated values of k. When the parameter kis
optimized to obtain the highest prediction
ability for a given data set, validation should
be performed by way of a three-set procedure.
p0560 Being a nonprobabilistic method, k-NN is free
from statistical assumptions esuch as
normality of variable distributions eand
requirements from limitations on the number
of variables. This assures a wide applicability.
Furthermore, in many applications, it has been
shown to perform as well as or better than
more complex methods (Vandeginste et al.,
1998; Dudoit et al., 2002).
s01202.3.4.3.2. A NONPARAMETRIC CLASS-
MODELING TECHNIQUE
p0565Derde et al. (1986) presented a simple and effi-
cient nonparametric class-modeling technique,
closely related to k-NN, which defines the class
space on the basis of a critical distance from the
objects of the training set (see Fig. 2.10). Several
settings can be varied, such as the type of distance
(Euclidean or Mahalanobis), and the strategies
adopted to determine the critical distance value.
Unfortunately, this promising class-modeling
technique has not been used anymore, and it
would merit a thorough reconsideration.
s01252.3.4.3.3. SOFT INDEPENDENT MODELING OF
CLASS ANALOGY (SIMCA)
p0570Soft independent modeling of class analogy
(SIMCA) (Wold and Sjo
¨stro
¨m, 1977) was the
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0050 FIGURE 2.9 Example of k-NN class delimiter for k¼1 (artificial data).
2.3. MULTIVARIATE DATA ANALYSIS 49
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
first class-modeling technique introduced into
chemometrics. This method builds class
models based on PCA performed using only
the samples of the category studied, generally
after within-class autoscaling or centering. In
more detail, SIMCA models are defined by
the range of the sample scores on a selected
number of low-order principal components
(PCs) eideally the significant ones eand
models therefore correspond to rectangles
(two PCs), parallelepipeds (three PCs), or
hyper-parallelepiped (more than three PCs)
referred to as the multidimensional boxes of
SIMCA inner space. Conversely, the principal
components not used to describe the model
define the outer space, which represents the
space of uninformative variations, often due
to noise. The score range can be enlarged or
reduced, mainly depending on the number of
samples, to avoid the possibility of under- or
overestimation of the true variability (Forina
and Lanteri, 1984). The standard deviation of
the distance of the objects in the training set
from the model corresponds to the class stan-
dard deviation. The boundaries of the SIMCA
space around the model are determined (as
shown in Fig. 2.11) by a critical distance, which
is obtained by means of the Fisher statistics.
SIMCA is a very flexible technique since it
allows variation in a large number of parame-
ters such as scaling or weighting of the original
variables, number of components, and
expanded or contracted score range.
s01302.3.4.4. Probabilistic Techniques
s01352.3.4.4.1. LINEAR DISCRIMINANT ANALYSIS
(LDA)
p0575Linear discriminant analysis (LDA) is
the first multivariate classification technique,
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X
1
X
2
f0055 FIGURE 2.10 Example of nonparametric class space (shadowed region enclosed by the red line) for the class of interest
(artificial data).
2. DATA ANALYSIS AND CHEMOMETRICS50
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
introduced by Fisher (1936). It is a probabilistic
parametric technique, i.e. it is based on the
estimation of multivariate probability density
functions, which are entirely described by
a minimum number of parameters: means, vari-
ances, and covariances. LDA is based on the
hypotheses that the probability density distribu-
tions are multivariate normal and that the
dispersion is the same for all the categories.
This means that the varianceecovariance matrix
is the same for all of the categories, while the
centroids are different (different location). In
the case of two variables, the probability density
function is bell-shaped and its elliptic section
lines correspond to equal probability density
values and to the same Mahalanobis distance
from the centroid.
p0580 Because of the above-mentioned LDA
hypotheses, the ellipses of different categories
present equal eccentricity and axis orientation:
they only differ for their location in the plane.
By connecting the intersection points of each
couple of corresponding ellipses, a straight
line is identified which corresponds to the
delimiter between the two classes (see
Fig. 2.12). For this reason, this technique is
called linear discriminant analysis. The direc-
tions which maximize the separation between
pairs of classes are the so-called canonical
variables.
s01402.3.4.4.2. QUADRATIC DISCRIMINANT
ANALYSIS (QDA)
p0585Quadratic discriminant analysis (QDA) is
a probabilistic parametric classification technique
which represents an evolution of LDA for
nonlinear class separations. Also QDA, like
LDA, is based on the hypothesis that the
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0060 FIGURE 2.11 SIMCA normal-range model (red segment) and class space (shadowed region enclosed by the red line) for
the class of interest (artificial data).
2.3. MULTIVARIATE DATA ANALYSIS 51
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
probability density distributions are multivariate
normal but, in this case, the dispersion is not the
same for all of the categories. It follows that the
categories differ not only for the position of their
centroid but also for the varianceecovariance
matrix (different location and dispersion). Conse-
quently, the ellipses of different categories differ
also for eccentricity and axis orientation (Geisser,
1964). By connecting the intersection points of
each couple of corresponding ellipses (at the
same Mahalanobis distance from the respective
centroids), a quadratic delimiter is identified,
which is a parabola in the bidimensional case, as
represented in Fig. 2.13.
s0145 2.3.4.4.3. UNEQUAL CLASS MODELS (UNEQ)
p0590 UNEQ is a powerful class-modeling tech-
nique, which originated in the work of Hotelling
(1947) and was launched in chemometrics by
Derde and Massart (1986). The method, closely
related to QDA, is based on the hypothesis of
a multivariate normal distribution in each cate-
gory studied and on the use of Hotelling’s T
2
statistics to define a class space, whose boundary
is an ellipse (two variables), an ellipsoid (three
variables), or a hyper-ellipsoid (more than three
variables). The dispersion of the class space is
defined by the critical value of the T
2
statistics
at a selected confidence level (see Fig. 2.14). The
eccentricity and the orientation of the ellipse
depend on the correlation between the variables
and on their dispersion.
p0595These probabilistic techniques present some
restrictions on the number of objects that can
be used. From a strictly mathematical point
of view, objects have to be one more than
the number of variables measured. Neverthe-
less, in order to obtain reliable results, these
techniques should be applied in cases when
the ratio between the number of objects in
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0065 FIGURE 2.12 Iso-probability ellipses under LDA hypotheses and the resultant linear class delimiter (artificial data).
2. DATA ANALYSIS AND CHEMOMETRICS52
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
a given category and the number of variables
is at least three. Furthermore, the number of
objects in each class should be nearly
balanced: it is not advisable to work when
ratios between number of objects in different
categories are greater than three (Derde and
Massart, 1989).
p0600 In cases involving many variables, it is
possible to apply LDA and QDA-UNEQ
following a preliminary reduction in the vari-
able number, for instance, by a PCA-based
compression.
s0150 2.3.4.4.4. POTENTIAL FUNCTIONS METHODS
p0605 Potential function techniques were intro-
duced into chemometrics by Coomans and
Broeckaert (1986). These methods estimate
a probability density distribution as a sum of
contributions of each single sample in a training
set. A variety of functions can be used to define
the individual contributions. The most
commonly used are Gaussian-like functions,
with a smoothing coefficient that is formally
analogous to the standard deviation of the
Gaussian probability function, thus deter-
mining the shape of the distribution. Such coef-
ficient can be the same for all the samples of
a given class (fixed potential), otherwise it can
be varied as a function of the local density of
samples: such latter strategy, known as normal
variable potential, is useful when the under-
lying multivariate distribution is very asym-
metric, with regions characterized by
nonuniform density of samples (Forina et al.,
1991).
p0610The value of the smoothing coefficient can be
optimized by means of a leave-one-out proce-
dure with an optimization sample set.
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0070 FIGURE 2.13 Iso-probability ellipses under QDA hypotheses and the resultant quadratic class delimiter (artificial data).
2.3. MULTIVARIATE DATA ANALYSIS 53
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0615 As represented in Fig. 2.15, the resulting esti-
mated overall probability distribution can be
very complex, capable of effectively describing
nonuniform distributions of samples. From the
probability distribution, the boundary of the
class space can be obtained at a selected confi-
dence level.
s0155 2.3.5. Supervised Quantitative
Modeling
p0620 Regression defines mathematical relationships
between variables or groups of variables, and
provides models for quantitative predictions.
p0625 Regression techniques can be univariate or
multivariate, depending on the number of
predictors and, eventually, of response variables
involved. Furthermore, they can be linear or
nonlinear, depending on the type of relationship
they are able to model.
p0630Univariate linear regression is a very common
tool in analytical chemistry, generally used to
describe the relation between a chemical quan-
tity (typically, the concentration of an analyte in
a series of standards) and a measured physical
variable (e.g., absorbance values at a given wave-
length). The mathematical model obtained is
used inversely, to compute the chemical quantity
in real samples from the values of the physical
measures performed on them.
s01602.3.5.1. Ordinary Least Squares (OLS)
p0635The ordinary least-squares (OLS) eor clas-
sical least-squares (CLS) emethod is probably
the most widely used and studied historically.
It looks for the combination of parameters of
the linear model (intercept and slope) that
provides the minimum value for the squared
residuals (i.e., the squared differences between
the values estimated by the model and the
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0075 FIGURE 2.14 UNEQ model (red cross) and class space (shadowed region enclosed by the red line) for the class of interest
(artificial data).
2. DATA ANALYSIS AND CHEMOMETRICS54
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
corresponding true values). The statistical
implications lying on the basis of the method
allow us to calculate the confidence interval
for each predicted value.
p0640 OLS can be applied as well to multivariate
data, namely when the predictors are two or
more. In such cases, the method is also known
as multivariate linear regression (MLR) (Draper
and Smith, 1981).
p0645 The model can be expressed as a mathemat-
ical relationship between the response yand
the Vpredictors:
^
y¼b0þb1x1þb2x2þ.þbVxV(2.10)
that is, in the matrix notation:
^
y¼X0b(2.11)
where Xis the matrix of the predictors
augmented with a column of 1, necessary for
the estimation of the intercept values, and bis
the column vector of the regression coefficients.
The regression coefficients are estimated by
b¼ðX0XÞ1X0y(2.12)
The elements of the vector yare the reference
values of the response variable, used for
building the model.
p0650The uncertainty on the coefficient estimation
varies inversely with the determinant of the
information matrix ðX0XÞwhich, in the case of
a unique predictor, corresponds to its variance.
In the multivariate cases, the determinant value
depends on the variance of the predictors and
on their inter-correlation: a high correlation
gives a small determinant of the information
matrix, which means a large uncertainty on
the coefficients, and, consequently, unreliable
regression results.
p0655This is the typical situation when vectors cor-
responding to almost continuous signals (e.g.,
spectra) are used as predictors. In fact, in such
cases, contiguous variables are considerably
inter-correlated. In a spectrum, for instance,
-2 0 2 4 6 8 10 12
-2
0
2
4
6
8
10
12
X1
X2
f0080FIGURE 2.15 Potential functions
class space (shadowed region
enclosed by the red line) for the class
of interest (artificial data).
2.3. MULTIVARIATE DATA ANALYSIS 55
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
absorbances evaluated at two consecutive
wavelengths frequently carry almost the same
information, so that their correlation coefficient
is nearly 1. In such cases, standard OLS is abso-
lutely not recommendable.
p0660 Furthermore, the number of objects required
for OLS regression must be at least equal to
the number of predictors plus 1. Such a condi-
tion is rarely satisfied in many practical cases.
s0165 2.3.5.2. Principal-Component Regression
(PCR)
p0665 Principal-component analysis offers a very
simple approach to overcome these hurdles.
The model is obtained by a classical least-
squares approach, which uses a reduced
number of significant principal components,
computed from the original variables, as the
predictors (Jolliffe, 1982). The PCs are, by defini-
tion, orthogonal and, therefore, uncorrelated.
p0670 This technique, which is very efficient in
many cases, is known as principal-component
regression (PCR). Since not always the direc-
tions which explain the highest variance
amount (i.e. the lowest-order PCs) are the most
important in predicting a response variable, it
is possible to follow a refined approach, which
performs a stepwise selection of the principal
components to be used in the model on the basis
of their modeling efficiency.
s0170 2.3.5.3. Partial Least Squares (PLS)
p0675 PLS (partial least squares, or even projections
onto latent structures) is probably the most
widely used multivariate regression technique
(Wold et al., 2001) and represents a better solu-
tion to both of the problems of variable number
and inter-correlation. The latent structures,
more frequently called latent variables (LVs) or
PLS components, are directions in the space of
the predictors. In particular, the first latent vari-
able is the direction characterized by the
maximum covariance with the selected
response variable. The information related to
the first latent variable is then subtracted from
both the original predictors and the response.
The second latent variable is orthogonal to the
first one, being the direction of maximum
covariance between the residuals of the predic-
tors and the residuals of the response. This
approach continues for the subsequent LVs.
p0680The optimal complexity of the PLS model, i.e.
the most appropriate number of latent variables,
is determined by evaluating, with a proper vali-
dation strategy, the prediction error correspond-
ing to models with increasing complexity. The
parameters considered are usually the standard
deviation of the error of calibration (SDEC),
computed with the objects used for building
the model, and the standard deviation of the
error of prediction (SDEP), computed with
objects not used for building the model:
SDECðPÞ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN
i¼1ðyi^
yiÞ2
N
s(2.13)
where y
i
is the value of the response variable y
for sample i,^
yiis the corresponding value
computed or predicted by the model, and Nis
the number of samples.
p0685In general, the calibration error always
decreases when the number of LVs augments,
because the fitting increases (toward an overfit-
ting). On the contrary, the prediction error
generally decreases until a certain model
complexity and then raises: this indicates that
the LVs further introduced are bringing noise,
as shown in the example given in Fig. 2.16.
A simple and practical criterion is the choice
of the LV number corresponding to the SDEP
absolute minimum or ebetter eto the first local
minimum as the optimal model complexity. In
the example of Fig. 2.16, it corresponds to six
LVs.
p0690When the number of noisy (noninformative)
variables is too large, the performance of PLS
models may be improved by a selection of
useful predictors performed in advance (Forina
et al., 2007).
2. DATA ANALYSIS AND CHEMOMETRICS56
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
p0695 A number of PLS variants have been
deployed, for instance, for developing nonlinear
models and for predicting together two or more
response variables (PLS-2). Furthermore, when
category indices are taken as dummy response
variables, PLS may work as a classification
method which is usually called PLS discrimi-
nant analysis (PLS-DA) or discriminant PLS
(D-PLS).
s0175 2.3.6. Artificial Neural Networks
p0700 Artificial neural networks (ANNs) are
a family of nonparametric versatile tools that
can be employed both for data exploration and
for qualitative and quantitative predictive
modeling. ANNs offer some advantages. For
instance, they are generally well suited for
nonlinear problems, and the related software
is easily available. Conversely, a number of
important drawbacks should limit ANN use
only to the cases in which other simpler tech-
niques fail and, primarily, when a large number
of samples are available.
p0705Multilayer feedforward neural networks
(MLF) represent the configuration of ANNs
most widely applied to electronic tongue
data. An exemplificative scheme is shown in
Fig. 2.17.
p0710MLF are composed by a number of computa-
tional elements, called neurons, generally orga-
nized in three layers (Zupan, 1994). In the first
one, the input layer, there are usually Nneurons
I4
I...
I3
I2
I1
Hj
H...
H1
O2
O1
INPUT
LAYER
HIDDEN
LAYER
OUTPUT
LAYER
wn,j
wj,o
S
S
Z
S
Z
Correction
f0090FIGURE 2.17 Exemplificative scheme of general inter-
neuronal connections and transmission/correction mecha-
nisms for a multilayer feedforward neural network.
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
SD
LV number
SDEC
SDEP
f0085FIGURE 2.16 Example of calibra-
tion and prediction error typical
trends at the increasing of the PLS
model complexity (number of latent
variables).
2.3. MULTIVARIATE DATA ANALYSIS 57
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
which correspond to the original predictors. The
predictors are scaled (generally range scaled).
When their number is very large, the principal
components are often used, in order to reduce
the data amount and the computational time.
p0715 The first layer transmits the value of the
predictors to the second ehidden elayer. All
the neurons of the input layer are connected
to the Jneurons of the second layer by means
of weight coefficients, meaning that the J
elements of the hidden layer receive, as infor-
mation, a weighted sum Sof the values from
the input layer. They transform the information
received (S)bymeansofasuitabletransfer
function, frequently a sigmoid.
p0720 These neurons transmit information to the
third eoutput elayer, as a weighted combina-
tion (Z) of their values. The neurons in the
output layer correspond to the response vari-
ables which, in the case for classification, are
the coded class indices. The output neurons
transform the information Z, from the hidden
layer, by means of a further sigmoid or semilin-
ear function.
p0725 After a first random initialization of the
values, a learning procedure modifies the
weights w
n,j
and w
j
during several optimization
cycles, in order to improve the performances of
the net. The correction of the weights at each
step is proportional to the prediction error of
the previous cycle. The optimization of many
parameters and the elevated number of
learning cycles considerably increase the risk
of overfitting and, for this reason, a deep vali-
dation is required, with a consistent number
of objects.
p0730 Another type of widely employed ANNs is
represented by the Kohonen’s self-organizing
maps (SOMs), used for unsupervised explor-
atory analysis, and by the counterpropagation
(CP) neural networks, used for nonlinear regres-
sion and classification (Kohonen, 2001). In addi-
tion, these tools require a considerable number
of objects to build reliable models, and a severe
validation.
s0180Acknowledgment
p0735The authors wish to thank Mr. Patrick Guerin for the careful
revision of the manuscript.
References
American Oil Chemist’s Society, 1998. Official Methods and
Recommended Practices of the American Oil Chemists’
Society, fifth ed. AOCS, Champaign, IL.
Barnes, R.J., Dhanoa, M.S., Lister, S.J., 1989. Standard
normal variate transformation and de-trending of near-
infrared diffuse reflectance spectra. Appl. Spectrosc. 43
(5), 772e777.
Box, G.E.P., Hunter, W.G., Hunter, J.S., 1978. Statistics for
experimenters: an introduction to design, data analysis,
and model building. Wiley, New York.
Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A.,
1983. Graphical Methods for Data Analysis. Chapman &
Hall, New York.
Coomans, D., Broeckaert, I., Derde, M.P., Tassin, A.,
Massart, D.L., Wold, S., 1984. Use of a microcomputer for
the definition of multivariate confidence regions in
medical diagnosis based on clinical laboratory profiles.
Comput. Biomed. Res. 17, 1e14.
Coomans, D., Broeckaert, I., 1986. Potential Pattern Recog-
nition in Chemical and Medical Decision Making.
Research Studies Press, England. Letchworth.
Derde, M.P., Kaufman, L., Massart, D.L., 1986. A non-
parametric class modelling technique. J. Chemomem. 3,
375e395.
Derde, M.P., Massart, D.L., 1986. UNEQ: a disjoint model-
ling technique for pattern recognition based on normal
distribution. Anal. Chim. Acta 184, 33e51.
Derde, M.P., Massart, D.L., 1989. Evaluation of the required
sample size in some supervised pattern recognition
techniques. Anal. Chim. Acta 223, 19e44.
Dudoit, S., Fridly, J., Speed, P., 2002. Comparison of
discrimination methods for the classification of tumors
using gene expression data. J. Am. Stat. Assoc. 97, 77e87.
Draper, N.R., Smith, H., 1981. Applied Regression Analysis,
second ed. Wiley, New York.
Fearn, T., 2009. The effect of spectral pre-treatments on
interpretation. NIR News 20, 16e17.
Fisher, R.A., 1936. The use of multiple measurements in
taxonomic problems. Ann. Eugen. 7, 179e188.
Forina, M., Lanteri, S., 1984. Chemometrics: mathematics
and statistics in chemistry. In: Kowalski, B.R. (Ed.),
NATO ASI Series, Ser. C, vol. 138. Reidel Publ. Co.,
Dordrecht, pp. 439e466.
Forina, M., Armanino, C., Castino, M., Ubigli, M., 1986.
Multivariate data analysis as a discriminating method of
the origin of wines. Vitis 25, 189e201.
2. DATA ANALYSIS AND CHEMOMETRICS58
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
Forina, M., Armanino, C., Leardi, R., Drava, G., 1991. A
class-modelling technique based on potential functions.
J. Chemom. 5, 435e453.
Forina, M., Lanteri, S., Rosso, S., 2001. Confidence intervals of
the prediction ability and performance scores of classifi-
cationsmethods. Chemometr. Intell. Lab. Syst. 57,121e132.
Forina, M., Lanteri, S., Casale, M., 2007. Multivariate cali-
bration. J. Chromatogr. A 1158, 61e93.
Forina, M., Oliveri, P., Lanteri, S., Casale, M., 2008. Class-
modeling techniques, classic and new, for old and new
problems. Chemometr. Intell. Lab. Syst. 93, 132e148.
Geisser, S., 1964. Posterior odds for multivariate normal
distributions. J. R. Stat. Soc. Series B Stat Methodological
26, 69e76.
Geladi, P., Manley, M., Lestander, T., 2003. Scatter plotting in
multivariate data analysis. J. Chemom. 17, 503e511.
Hotelling, H., 1947. Multivariate Quality Control. In:
Eisenhart, C., Hastay, M.W., Wallis, W.A. (Eds.), Tech-
niques of Statistical Analysis. McGraw-Hill, New York,
pp. 111e184.
Iman, R.L., 1982. Graphs for use with the Lilliefors test for
normal and exponential distributions. Amer. Statist. 36,
109e112.
Jellema, R.H., 2009. Variable shift and alignment. In:
Brown, S.D., Tauler, R., Walczak, B. (Eds.), Comprehensive
Chemometrics, vol. 2. Elsevier, Amsterdam, pp. 85e108.
Jolliffe, I.T., 1982. Anote on the use of principal components in
regression. J. R.Stat. Soc. Ser. C (Appl. Stat.) 31 (3), 300e303.
Jolliffe, I.T., 2002. Principal Component Analysis, second ed.
Springer, New York. 201e207.
Kennard, R.W., Stone, L.A., 1969. Computer aided design of
experiments. Technometrics 11, 137e148.
Kjeldahl, K., Bro, R., 2010. Some common misunderstand-
ings in chemometrics. J. Chemom. 24, 558e564.
Kolmogorov, A., 1933. Sulla determinazione empirica di una
legge di distribuzione. G. Inst. Ital. Attuari 4, 83e91.
Kohonen, T., 2001. Self Organizing Maps, third ed. Springer,
New York, NY.
Lilliefors, H.W., 1970. On the KolmogoroveSmirnov test for
normality with mean and variance unknown. J. Amer.
Stat. Assoc. 62, 399e405.
Martens, H., Kohler, A., 2008. Bio-spectroscopy and bio-
chemometrics: high-throughput metabolic profiling for
integrative genetics. In: Proceedings of the Metab-
omeeting 2008 Conference, 28e29th April 2008. Ecole
Normale Supe
´rieure de Lyon, Lyon, France, p. 18.
Nielsen, N.-P.V., Carstensen, J.M., Smedsgaard, J., 1998.
Aligning of single and multiple wavelength chromato-
graphic profiles for chemometric data analysis using
correlation optimised warping. J. Chromatogr. A 805,
17e35.
Oliveri, P., Casale, M., Casolino, M.C., Baldo, M.A., Nizzi-
Grifi, F., Forina, M. A comparison between classical and
innovative class-modelling techniques for the charac-
terisation of a PDO olive oil, Anal. Bioanal. Chem, in
press. doi:10.1007/s00216-010-4377-1.
½
AU3
Pearson, K., 1901. On lines and planes of closest fit to
systems of points in space. Philos. Mag. 2 (6),
559e572.
Reis, M.S., Saraiva, P.M., Bakshi, B.R., 2009. Denoising and
signal-to-noise ratio enhancement: wavelet transform
and fourier transform. In: Brown, S.D., Tauler, R.,
Walczak, B. (Eds.), Comprehensive Chemometrics,
vol. 2. Elsevier, Amsterdam, pp. 25e55.
Savitzky, A., Golay, M.J.E., 1964. Smoothing and differenti-
ation of data by simplified least squares procedure.
Anal. Chem. 36, 1627e1639.
Sharoba, A.M., Senge, B., El-Mansy, H.A., Bahlol, H.ElM.,
Blochwitz, R., 2005. Chemical, sensory and rheological
properties of some commercial German and Egyptian
tomato ketchups. Eur. Food Res. Technol. 220, 142e151.
Smirnov, N.V., 1939. On the estimation of the discrepancy
between empirical curves of distribution for two inde-
pendent samples. Bull. Math. Univ. Moscow 2, 3e14.
Snedecor, G.W., Cochran, W.G., 1989. Statistical Methods,
eighth ed. Iowa State University Press.
Snee, R., 1977. Validation of regression models: methods
and examples. Technometrics 19, 415e428.
Student, 1908. The probable error of a mean. Biometrika 6,
1e25.
Taavitsainen, V.M., 2009. Denoising and signal-to-noise ratio
enhancement: derivatives. In: Brown, S.D., Tauler, R.,
Walczak, B. (Eds.), Comprehensive Chemometrics,
vol. 2. Elsevier, Amsterdam, pp. 57e66.
Taguchi, G., 1986. Introduction to Quality Engineering.
Designing Quality into Products and Processes. Asian
Productivity Organization, ASI Press, Dearborn.
Valca
´rcel, M., Ca
´rdenas, S., 2005. Vanguard-rearguard
analytical strategies. Trends Anal. Chem. 24, 67e74.
Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De
Jong, S., Lewi, P.J., Smeyers-Verbeke, J., 1998. Handbook
of Chemometrics and Qualimetrics, vol. 20B. Elsevier,
Amsterdam.
Wold, S., 1972. Spline functions, a new tool in data-analysis.
Kem. Tidskr. 84, 34e37.
Wold, S., Sjo
¨stro
¨m, M., 1977. SIMCA: a method for analysing
chemical data in terms of similarity and analogy. In:
Kowalski, B.R. (Ed.), Chemometrics: Theory and Appli-
cations, ACS Symposium Series 52. American Chemical
Society, Washington, pp. 243e282.
Wold, S., Sjo
¨stro
¨m, M., Eriksson, L., 2001. PLS-regression:
a basic tool of chemometrics. Chemom. Intell. Lab. Syst.
58, 109e130.
Zupan, J., 1991. Introduction to artificial neural network
(ANN) methods: what they are and how to use them.
Acta Chim. Slov. 41, 327e352.
REFERENCES 59
I. CHEMICAL ANALYSIS OF FOOD
10002-PICO-9780123848628
To protect the rights of the author(s) and publisher we inform you that this PDF is an uncorrected proof for internal business use only by the author(s), editor(s), reviewer(s),
Elsevier and typesetter TNQ Books and Journals Pvt Ltd. It is not allowed to publish this proof online or in print. This proof copy is the copyright property of the publisher
and is confidential until formal publication.
AUTHOR QUERY FORM
Book:
PICO-9780123848628
Chapter:
02
Please e-mail your responses and any
corrections to:
E-mail: S.Vadivelan@elsevier.com
Dear Author,
Any queries or remarks that have arisen during the processing of your manuscript are listed below and are highlighted by
flags in the proof. (AU indicates author queries; ED indicates editor queries; and TS/TY indicates typesetter queries.)
Please check your proof carefully and answer all AU queries. Mark all corrections and query answers at the appropriate
place in the proof (e.g., by using on-screen annotation in the PDF file http://www.elsevier.com/framework_authors/
tutorials/ePDF_voice_skin.swf) or compile them in a separate list, and tick off below to indicate that you have answered
the query.
Please return your input as instructed by the project manager.
Uncited reference: References that occur in the reference list but are not cited in the text. Please position each
reference in the text or delete it from the reference list.
NIL
Missing reference: References listed below were noted in the text but are missing from the reference list. Please
make the reference list complete or remove the references from the text.
NIL
Location in Chapter Query / Remark
AU:1, page 25 Citations “Oliveri et al., 2011; Zupan, 1994” have not been found in the reference list.
Please provide full details for these references or delete citations from text.
,
AU:2, page 27 Please note that, the terms “data set red wines” and “red-wine data set” both are used.
Please check and made consistent if necessary.
,
AU:3, page 59 References “Oliveri et al., in press; Zupan, 1991” have not been cited in text but occur in
the reference list. Please cite each reference in the text or, alternatively, delete them from
the reference list.
,
AU:4, page 39 Please check the layout of Table 2.3.
,
PICO: 02
Non-Print Items
Abstract:
Data mining is usually the last, but not for this less important, step of any food analysis process. It rather represents a critical
phase: in fact, a proper data processing allows the extraction of useful information about the system under study from large
amounts of collected data eand getting information is usually the main objective in analytical chemistry. The classical
univariate approach, which considers one variable at a time, underutilizes the global data structure and offers just a partial
image of it. Instead, multivariate strategies allow a more complete interpretation of data and exploitation of the
information contained therein. Multivariate techniques can be used both for exploratory purposes and for qualitative or
quantitative modeling. Generally, modeling is performed for predictive applications: in such cases, a thorough model
validation is always required.
Keywords: multivariate data analysis; chemometrics; pattern recognition; modeling; validation