Electronic copy available at: http://ssrn.com/abstract=1107815
Measures of fit in multiple correspondence analysis of crisp and
fuzzy coded data
Zerrin Aşan1 and Michael Greenacre2
1Department of Statistics, Anadolu University, Eskişehir, Turkey
2Department of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain
When continuous data are coded to categorical variables, two types of coding are possible: crisp
coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where
each observation is transformed to a set of “degrees of membership” between 0 and 1, using co-called
It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence
analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the
solution in a low-dimensional space. Since the crisp data only code the categories to which each
individual case belongs, an alternative measure of fit is simply to count how well these categories are
predicted by the solution. Another approach is to consider multiple correspondence analysis
equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the
categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal
tables of the Burt matrix – the measure of fit is then computed as the quality of explaining these tables
The correspondence analysis of fuzzy coded data, called “fuzzy multiple correspondence analysis”,
suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions
are made of the categories which have highest degree of membership. But here one can also defuzzify
the results of the analysis to obtain estimated values of the original data, and then calculate a measure
of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance.
Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way
associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the
crisp case can be applied to analyse the off-diagonal part of this matrix.
In this paper these alternative measures of fit are defined and applied to a data set of continuous
meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit
is further discussed when the data set consists of a mixture of discrete and continuous variables.
Key words: data coding, defuzzification, fuzzy coding, indicator matrix, joint correspondence
analysis, measure of fit, multiple correspondence analysis, Burt matrix.
The first author thanks Sevil Sentürk for assistance and useful comments. The second author thanks
the Fundación BBVA for financial support in this research, as well as the Spanish Ministry of
Education and Science for partial support in the form of grant MEC-SEJ2006-14098.
Multiple correspondence analysis (MCA) is the correspondence analysis (CA) of a data set of
categorical variables that are coded as zero-one (dummy) variables in an indicator matrix (also called
“logical coding”) or as a matrix composed of all two-way contingency tables called a “Burt matrix”
(see, for example, Benzécri (1973) or, for a recent account, Greenacre and Blasius (2006)). In CA the
eigenvalues, or principal inertias, expressed as percentages relative to their total, are used as a measure
of fit, but in MCA it is well known that these percentages give pessimistic estimates of the quality of
the solution. For example, Lebart (1984, 2006), states that in MCA the “percentages of variance are
misleading measures of information”. The main issue here is to consider the type of data entering the
CA algorithm and whether it makes sense to measure the fit in terms of the usual reconstruction of
these data by the low-dimensional solution. In the case of the input indicator matrix, it seems obvious
that we are not interested in an exact reconstruction of the zeros and ones in the data, so to measure the
fit in this way makes little sense. An alternative that seems more appropriate to the discrete zero-one
data would be to measure how well the MCA solution predicts the categories to which each case
belongs. In the case of the input Burt matrix, the reason for the low percentages is clear when we
consider what is included in “total inertia”, which constitutes the denominator in calculating the
measure of fit. Down the diagonal of the Burt matrix are cross-tabulations of each variable with itself,
inflating the total inertia by amounts that are firstly, without any interest, and secondly, impossible to
explain in a low-dimensional solution. Greenacre (1988, 1991, 2007: chap. 19) presented an
alternative way of performing CA in this situation, leading to much higher percentages of inertia,
because only the “interesting” inertia in the off-diagonal tables of the Burt matrix is being explained.
This approach, called joint correspondence analysis (JCA), contains simple CA (i.e., with two
variables) as an exact special case, and so provides a more natural generalization of the bivariate
problem to the multivariate one.
The present work extends these ideas to the CA of a fuzzy-coded matrix, which is a generalization of
the “crisp” zero-one indicator matrix that allows the coded values to be “fuzzy”, i.e., to be real
numbers between 0 to 1 (inclusive) for each category, while maintaining the property that the coded
values for each variable sum to 1. To our knowledge, fuzzy coding was introduced into the French
literature on CA in 1977 by Guitonneau and Roux (1977), while van Rijckevorsel (1988) attributes the
idea originally to the doctoral thesis of Bordet (1973). The idea has been entering into different fields
of application of multivariate methods, for example see Chevenet (1994) for an application of fuzzy
coding in an ecological context.
We use the term “fuzzy MCA” for the CA of the fuzzy coded matrix, and we will show that the
problem of pessimistic percentages of explained inertia also occurs in fuzzy MCA, albeit attenuated.
Electronic copy available at: http://ssrn.com/abstract=1107815
Various alternative measures of fit will be discussed – these all measure how the CA solution recovers
the data in different ways. The standard way is to measure fit to the input data, whatever these data
may be. The alternatives look at the nature of the input data and define the measure of fit in terms of
recovering what is considered to be the “interesting” part of the data. We will be concerned mainly
with the following three alternatives:
1. Suppose we are interested in simply predicting the categories to which each case belongs or for
which it has highest degree of membership, rather than the exact values 1 and 0 (crisp coding) or
the degrees of membership themselves (fuzzy coding). Then the measure of fit can be calculated
as the percentage of categories correctly predicted (Sections 4.1 and 4.2)
2. Suppose we are interested in predicting the original continuous variables. We can estimate the
continuous variables from the MCA solution, which is inherently fuzzy in both crisp and fuzzy
analyses. Then a measure of fit is calculated between these estimates and the original data
(Sections 4.3 and 4.4)
3. Suppose we are interested in the two-way relationships summarized in the Burt matrix, based on
crisp or fuzzy coded data. Then we can use JCA, already existent for crisp MCA, to explain the
off-diagonal tables which cross-tabulate distinct pairs of variables, ignoring the cross-tabulations
of each variable with itself (Section 4.5 and 4.6).
The proposed approaches will be illustrated using meteorological data from 40 cities of Turkey. We
first describe the coding of these data in Section 2 and various “standard” results of analysing these
data in Section 3, using principal component analysis (PCA), MCA and fuzzy MCA. Then we will
describe the properties of each of the above measures of fit in Section 4, with a discussion in Section 5
where we also giving some guidelines as to the circumstances when one will be more preferable than
the others, and some proposals for the analysis of mixed continuous–categorical data.
2. Continuous data and their recoding for CA
Table 1 contains average values of five meteorological variables for 40 cities of Turkey, based on
measurements taken in 2004 (Turkey Statistical Yearbook, 2004) – note that we use this example as a
convenient illustration of our approach rather than a substantive meteorological application.
There are several standard ways in the CA literature to recode continuous data in a form suitable for
CA, for example re-expressing each variable relative to its range, or replacing data values by their
ranks (see Greenacre, 2007: chapter 23); here we consider the discretization of the continuous scales
into a set of categorical ones. In other words, suppose we define three categories for each variable,
which we could call “low” (1), “middle” (2) and “high” (3), then we can replace each value by its
category number and use MCA to analyse the data. For each variable two boundary points need to be
chosen in order to divide the scale into three intervals. These boundary points can be chosen in
different ways, and thanks to the principle of distributional equivalence in CA (see, for example,
Greenacre, 2007: pages 37–38), the particular choice is not so critical. Clearly this recoding of the
data loses a lot of information, so an alternative fuzzy coding can be performed which conserves more
of the information in the data while still maintaining the low-middle-high idea. Again, there are
several ways of coding the data coding into fuzzy categories, called fuzzification (as opposed to
discretization), described in Loslever and Bouilland (1999) and Murtagh (2005), for example. We
have chosen the system of so-called “three-point triangular membership functions” shown in Figure 1
(also called piecewise linear functions, or second order B-splines – see van Rijckevorsel (1988) for a
theoretical account of this topic). The three values of the minimum, median and maximum of the
variable are used as so-called “hinge” points to define triangles which convert the continuous scale
into values on a 0 to 1 scale such that the membership values sum to 1. The first and last triangles are
“shouldered”, with maxima of 1 at the continuous variable’s minimum and maximum values
respectively, while the middle triangle has a maximum of 1 at the median. The definition of the set of
triangular membership functions is thus as follows, for the input continuous variable x which has
minimum value m1, median m2 and maximum value m3:
These functions give the three degrees of membership that code the variable x into the three fuzzy
categories. Notice that a particular input value will belong to no more than two fuzzy categories, and
that the membership degrees sum to 1. Alternatives to triangular membership functions can be
trapezoidal, Gaussian and generalized bell (or Cauchy) membership functions, which have various
theoretical advantages (see, for example, Jang, Sun and Mizutani, 1997; Senturk, 2006), and we could
also consider more than three fuzzy categories. These specific aspects about fuzzy coding have been
dealt with extensively in the literature – see, for example, Zhou, Purvis and Kazabov (1997) and
Verkuilen (2005) for a discussion of choice of membership functions (the latter reference gives a more
statistical treatment of the subject and shows connections to psychometric scaling). Here we have
chosen one of the simplest forms of coding for illustrative purposes, since our purpose is to deal with
properties of the CA of such data.
The fuzzy coded data are given in Table 2. To compare this with crisp coded indicator data, we
constructed a table consisting of zero-one dummies where two out of the three categories are zeros and
the value of one is assigned to the category with highest membership value according to Table 2 – that
is, the boldface values in Table 2 are replaced by 1s and the rest 0s.
3. PCA, MCA and fuzzy MCA
We analyse the three forms of the data in Tables 1, 2 and the crisp data to make an initial comparison
of their results, focusing on the value and interpretation of the standard measure of fit in each case.
3.1 Principal component analysis (PCA)
Figure 2 shows the principal component analysis (PCA) of Table 1, where the data have been
standardized, and where the map has been scaled to be a biplot (see, for example, Gower and Hand,
1996). The map explains 75.6% of the variance, but what exactly does this measure of fit mean?
Taking the view of PCA as a method of matrix approximation, we can think of Figure 2 as a way of
approximating the standardized data. Suppose that the original data are denoted by the 40×5 matrix X
= [xij], and the standardized data by Y = [yij], with elements
jj ij ij
the mean and standard deviation respectively of the j-th column variable. Then the PCA of Y is based
on the singular-value decomposition (SVD):
where n = 40. The biplot of Figure 2 is constructed using the first two columns of UDα and √nV ,
denoted by F (40×2) and Γ Γ Γ Γ (5×2) respectively, so that approximated standardized values
ij yˆ can be
computed by calculating all 200 scalar products between the two sets of points:
i=1,…,40; j=1,…,5, i.e.
This is usually thought of geometrically as a projection of the row
(city) points onto the vectors defined by the variables , with appropriate calibration of the axes which
depends on the length of these vectors (Gabriel and Odoroff, 1990). The total variance of the matrix Y,
with value 5 in this case because there are five variables each with variance 1, can be decomposed into
two parts, thanks to the orthogonality between the solution estimates
ij yˆ and the errors
i.e., in our example 5 = 3.782 + 1.218; hence the figure of 75.6% is the explained part (3.782) of the
decomposition relative to the total, whereas the unexplained, or error, part (1.218) is 24.4% of the
total. Literally, 75.6% tells us how close the map is to the standardized data in the least-squares sense.
Putting this another way by using an analogy with regression analysis, the two principal axes of the
map are orthogonal latent variables predicting the data in Y, the first axis explaining 43.3% of the
variance, and the second an additional 32.3%, that is a coefficient of determination R2 = 0.756. The
explained variance of 3.782 can be further decomposed into parts for the five variables in order to see
which variables are explained better than others: 3.782 = 0.831 + 0.793 + 0.620 + 0.801 + 0.737 (for
perfect explanation of a variable, its explained part would be 1).
3.2 Multiple correspondence analysis (MCA)
Figure 3 shows the CA of the crisp data, in other words the MCA of the indicator matrix Z (n×Q),
where Q = 5 is the number of variables. This analysis is based on the SVD of the centred and scaled
matrix (see, for example, Greenacre, 2006, p. 52):
V UDDD 11Z
2 / 1
where D is the diagonal matrix of column masses cj, i.e., the marginal frequencies of each category
relative to their grand total of nQ = 200. The asymmetric map in Figure 3 is constructed from
coordinates in the first two columns of √nUDα and D-1/2V, again denoted by F (40×2) and Γ Γ Γ Γ (15×2).
The two principal axes explain 26.2% and 19.7% of the inertia, totalling 45.9% for the two-
dimensional map. This is much less than the PCA, and once more we have to ask what this measure
of quality really means. Again, this is a measure of how well the data are reconstructed from the map,
but the data themselves are just zeros and ones whereas the reconstructed data from the MCA map are
real numbers. Data are again reconstructed using scalar products between these two sets of points as
follows (see, for example, Greenacre, 2007, page 101):
) 1 (
, or in matrix terms:
For example, the reconstructed data for the variable “Sun” in the city of Adana can be calculated from
the two-dimensional coordinates of Sun1, Sun2 and Sun3 and Adana to be 0.0622, 0.5630 and 0.3748
(notice that these approximated values sum to 1 – they are, in fact, fuzzy). The explained variance of
45.9% again measures how close the approximations [ 0.0622 0.5630 0.3748 ] are to the “observed”
indicator values [ 0 1 0 ], evaluated globally over all entries in Table 3. It is now understandable why
the measure of fit is doomed to be low, because the reconstructed data will always be fuzzy while the
original data is discrete.
MCA is often defined alternatively as the CA of the Burt matrix, the matrix composed of all two-way
cross-tabulations of the variables: B = ZTZ. In our example this will be a 15×15 matrix formed of 25
3×3 tables cross-tabulating all pairs of the five categorical variables. The standard coordinates of the
variables, i.e. the coordinates of the categories in Figure 3, are identical in this analysis, but the
eigenvalues are the squares of those of the indicator matrix, leading to an apparent improvement in the
percentages of inertia explained. In our example the percentages of inertia in two dimensions are now
43.7% and 24.9%, giving a total of 68.6%. This way of measure of fit can still be criticized as being
low, however, due to an artefact of the Burt matrix B (Greenacre, 1991). The tables down the
diagonal of the B, which cross-tabulate each variable with itself, have high inertias and these
artificially inflate the total inertia of B – in fact, in this example these diagonal tables account for
63.8% of the inertia of B. Percentages of explained inertia are thus low because the denominator in
the calculation is artificially high, and a way to resolve this is to avoid fitting these diagonal blocks of
B (see Sections 4.5 and 4.6 later).
3.3 Fuzzy MCA
Figure 4 shows the CA of the fuzzy coded data of Table 2, or fuzzy MCA. The theory here is identical
to that of MCA given in Section 3.2, simply replacing Z in (4) by the fuzzy coded matrix Z*, and
recomputing the column margins whose relative values define the diagonal matrix D. Several
properties of crisp MCA carry over to fuzzy MCA:
(a) Each set of categories has weighted average at the origin of the map, because the sum of each
set of fuzzy coded values is also 1.
(b) As in any CA, the row and column sums of the approximated data, reconstructed from the
low-dimensional solution (using (5) in the present context), are the same as the original data.
In addition, in the reconstructed fuzzy data, each set of membership levels sums to 1; as in the
case of crisp MCA, these sets of values are not necessarily all positive.
(c) A fuzzy Burt matrix can also be computed as B* = Z*TZ*: the CA of B* also has identical
standard (row or column) category coordinates to those (columns) of Z*, and the principal
inertias of B* are equal to the squares of those of Z*.
Since the fuzzy coded matrix Z* has less zeros than the indicator matrix, its total inertia will be lower
– in the example it is 1.025, compared to the value of exactly 2 for Z‡. Since both the input data and
the approximated values from the fuzzy MCA map are fuzzy, it is to be expected that fuzzy MCA will
do better in the reconstruction of data, in terms of the usual measure of fit. In the present example the
percentage of inertia explained in the two-dimensional map is 55.0%, about 9 percentage points higher
than the MCA of the crisp indicator matrix. Notice how much better separated the cities are from one
another in Figure 4 compared to Figure 3 – the cities that coincided in Figure 3 are now separated.
But our thesis is that the value of 55.0% variance explained is still artificially low, and can be
improved by reconsidering the way this quantity is measured in fuzzy MCA.
‡ The inertia of an indicator matrix constructed from Q variables that give rise to J categories is equal to
(J–Q )/Q (see Benzécri (1973) or Greenacre (1984) for a proof).
4. Measures of fit in crisp and fuzzy MCA
In this section we consider three alternative measures of fit introduced in Section 1, and apply each of
them to crisp and fuzzy MCA respectively.
4.1 Predicting categories: crisp MCA
In regular (crisp) MCA, the coding indicates the categories of each variable to which each individual
(a city in our application) belongs. An obvious measure of the quality of the solution is to count how
many of these categories are correctly predicted. For example, from the approximated values
[ 0.0622 0.5630 0.3748 ] given in Section 3.2 for the city of Adana, we would predict that Adana is
in category Sun2, which is a correct prediction and in this sense without error. Performing this
calculation for the whole table we obtain 171 correct predictions out of the total of 200, giving a
prediction accuracy of 85.5%, much higher than the usual measure of fit of 44.1% from the MCA of
the indicator matrix. The process of converting the fuzzy values in the approximated vector to discrete
ones, for example converting [ 0.0622 0.5630 0.3748 ] to [ 0 1 0 ], is called “defuzzification”, in this
case defuzzifying to discrete indicator data. Hence, once the approximated values from the MCA map
are defuzzified, the quality of the map is 85.5% as far as predicting the categories of the continuous
variables is concerned. The quality is much lower if we defuzzify the results of the MCA back to
estimations of the original data, as we shall describe in Section 4.3.
4.2 Predicting categories: fuzzy MCA
Taking the example of sunshine in Adana again, the fuzzy coded data are (from Table 3)
[ 0 0.634 0.366 ], while the approximated data are computed from the coordinates in Figure 4, using
the same reconstruction formula (5) as in crisp MCA, to be [ 0.112 0.393 0.496 ]. Thepercentage
55.0% computed in Section 3.3 is an evaluation of how close the approximations are to their original
fuzzy values, computed over the whole data set. If we now had to predict the category with highest
membership value (sunshine category Sun2), then we would be wrong: the highest reconstructed value
is 0.496 for category Sun3. We counted the correct category predictions using the fuzzy MCA map in
this way, and obtained 168 correct predictions out of 200, that is an accuracy of 84.0%, slightly less
than in the crisp case. In other words, defuzzifying (to discrete indicator values) both the fuzzy data
and their approximations, we obtain a prediction accuracy of 84.0%. Although the fuzzy MCA is
approximating the fuzzy data better (55.0% inertia explained, compared to 45.9% for crisp MCA), it is
giving slightly less accuracy in predicting the categories with highest degree of membership – this is
intuitively understandable because the indicator matrix codes these categories crisply, so should do
better in predicting them.
4.3 Estimating original data: crisp MCA
Since the approximations we obtain from crisp and fuzzy MCA solutions are fuzzy, adding up to 1 in
each case, we can defuzzify them to estimate the original variables. For example, consider again the
approximation of sunshine in Adana from the crisp MCA solution: [ 0.062 0.563 0.375 ]. Using
these values to weight the minimum (4.14), median (7.10) and maximum (8.33) hours of sunshine, we
obtain the estimate: 0.062×4.14 + 0.563×7.10 + 0.375×8.33 = 7.38, compared to the original value of
7.55. The rationale for computing the estimate in this way is that this weighted average calculation is
exactly the way we invert the fuzzification of the data according to the triangular membership
functions (see the next section).
We need to standardize the data and their estimations to be able to measure the fit, as we did in PCA.
The problem with this defuzzification is that the estimated values do not have the same variable means
as the original data, neither are the errors orthogonal to the estimated values, so we do not obtain a
decomposition of the form (3). The best we can do in this situation is to calculate a stress measure in
the form of the sum-of-squared errors, which is equal to 2.25 in this case, and then express it relative
to the total variance of 5, i.e., 44.5% stress. This error percentage is still better than the unexplained
inertia of 54.1% in the case of crisp MCA (its explained inertia was 45.9%), even though one would
not expect the crisp MCA to recover the original data well when only coarse zero-one indicator data
have been fitted. This is another demonstration that the explained variance in MCA is pessimistic,
even by this very conservative measure of error percentage. We would, however, expect the fuzzy
MCA to perform better on this criterion, as explained next, because the input fuzzy data code the
continuous data precisely.
4.4 Estimating original data: fuzzy MCA
In our application the fuzzy coding of the data is a 1-to-1 mapping, thanks to the use of triangular
membership functions, and the process can be reversed, that is, defuzzified back to the original data.
Using the same example as in the previous paragraphs, Adana’s original sunshine value of 7.55 can be
recovered back from the fuzzy values as the centroid of the minimum, median and maximum values,
using its fuzzy values as weights: 7.55 = 0×4.14 + 0.634×7.10 + 0.366×8.33 (this is a direct result of
(1)). In the same way, we can obtain a corresponding estimate from the fuzzy approximation
[ 0.112 0.393 0.496 ]: 0.112×4.14 + 0.393×7.10 + 0.496×8.33, which is equal to 7.39. After
performing this defuzzification for all 200 data values, several interesting results emerge:
1. The defuzzified estimates do have the same (column) means as the original data, and these
means are equal to weighted averages, like those above, using the means of the columns of Z*
as weights (the column means of Z* are the same as those of Z). For example, the means of
the first three columns of Z* corresponding to temperature are 0.191, 0.613 and 0.196: