Page 1

Electronic copy available at: http://ssrn.com/abstract=1107815

1

Measures of fit in multiple correspondence analysis of crisp and

fuzzy coded data

Zerrin Aşan1 and Michael Greenacre2

1Department of Statistics, Anadolu University, Eskişehir, Turkey

Email: zasan@anadolu.edu.tr

2Department of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain

Email: michael@upf.es

Abstract

When continuous data are coded to categorical variables, two types of coding are possible: crisp

coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where

each observation is transformed to a set of “degrees of membership” between 0 and 1, using co-called

membership functions.

It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence

analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the

solution in a low-dimensional space. Since the crisp data only code the categories to which each

individual case belongs, an alternative measure of fit is simply to count how well these categories are

predicted by the solution. Another approach is to consider multiple correspondence analysis

equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the

categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal

tables of the Burt matrix – the measure of fit is then computed as the quality of explaining these tables

only.

The correspondence analysis of fuzzy coded data, called “fuzzy multiple correspondence analysis”,

suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions

are made of the categories which have highest degree of membership. But here one can also defuzzify

the results of the analysis to obtain estimated values of the original data, and then calculate a measure

of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance.

Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way

associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the

crisp case can be applied to analyse the off-diagonal part of this matrix.

In this paper these alternative measures of fit are defined and applied to a data set of continuous

meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit

is further discussed when the data set consists of a mixture of discrete and continuous variables.

Key words: data coding, defuzzification, fuzzy coding, indicator matrix, joint correspondence

analysis, measure of fit, multiple correspondence analysis, Burt matrix.

Acknowledgments

The first author thanks Sevil Sentürk for assistance and useful comments. The second author thanks

the Fundación BBVA for financial support in this research, as well as the Spanish Ministry of

Education and Science for partial support in the form of grant MEC-SEJ2006-14098.

2

1. Introduction

Multiple correspondence analysis (MCA) is the correspondence analysis (CA) of a data set of

categorical variables that are coded as zero-one (dummy) variables in an indicator matrix (also called

“logical coding”) or as a matrix composed of all two-way contingency tables called a “Burt matrix”

(see, for example, Benzécri (1973) or, for a recent account, Greenacre and Blasius (2006)). In CA the

eigenvalues, or principal inertias, expressed as percentages relative to their total, are used as a measure

of fit, but in MCA it is well known that these percentages give pessimistic estimates of the quality of

the solution. For example, Lebart (1984, 2006), states that in MCA the “percentages of variance are

misleading measures of information”. The main issue here is to consider the type of data entering the

CA algorithm and whether it makes sense to measure the fit in terms of the usual reconstruction of

these data by the low-dimensional solution. In the case of the input indicator matrix, it seems obvious

that we are not interested in an exact reconstruction of the zeros and ones in the data, so to measure the

fit in this way makes little sense. An alternative that seems more appropriate to the discrete zero-one

data would be to measure how well the MCA solution predicts the categories to which each case

belongs. In the case of the input Burt matrix, the reason for the low percentages is clear when we

consider what is included in “total inertia”, which constitutes the denominator in calculating the

measure of fit. Down the diagonal of the Burt matrix are cross-tabulations of each variable with itself,

inflating the total inertia by amounts that are firstly, without any interest, and secondly, impossible to

explain in a low-dimensional solution. Greenacre (1988, 1991, 2007: chap. 19) presented an

alternative way of performing CA in this situation, leading to much higher percentages of inertia,

because only the “interesting” inertia in the off-diagonal tables of the Burt matrix is being explained.

This approach, called joint correspondence analysis (JCA), contains simple CA (i.e., with two

variables) as an exact special case, and so provides a more natural generalization of the bivariate

problem to the multivariate one.

The present work extends these ideas to the CA of a fuzzy-coded matrix, which is a generalization of

the “crisp” zero-one indicator matrix that allows the coded values to be “fuzzy”, i.e., to be real

numbers between 0 to 1 (inclusive) for each category, while maintaining the property that the coded

values for each variable sum to 1. To our knowledge, fuzzy coding was introduced into the French

literature on CA in 1977 by Guitonneau and Roux (1977), while van Rijckevorsel (1988) attributes the

idea originally to the doctoral thesis of Bordet (1973). The idea has been entering into different fields

of application of multivariate methods, for example see Chevenet (1994) for an application of fuzzy

coding in an ecological context.

We use the term “fuzzy MCA” for the CA of the fuzzy coded matrix, and we will show that the

problem of pessimistic percentages of explained inertia also occurs in fuzzy MCA, albeit attenuated.

Page 2

Electronic copy available at: http://ssrn.com/abstract=1107815

3

Various alternative measures of fit will be discussed – these all measure how the CA solution recovers

the data in different ways. The standard way is to measure fit to the input data, whatever these data

may be. The alternatives look at the nature of the input data and define the measure of fit in terms of

recovering what is considered to be the “interesting” part of the data. We will be concerned mainly

with the following three alternatives:

1. Suppose we are interested in simply predicting the categories to which each case belongs or for

which it has highest degree of membership, rather than the exact values 1 and 0 (crisp coding) or

the degrees of membership themselves (fuzzy coding). Then the measure of fit can be calculated

as the percentage of categories correctly predicted (Sections 4.1 and 4.2)

2. Suppose we are interested in predicting the original continuous variables. We can estimate the

continuous variables from the MCA solution, which is inherently fuzzy in both crisp and fuzzy

analyses. Then a measure of fit is calculated between these estimates and the original data

(Sections 4.3 and 4.4)

3. Suppose we are interested in the two-way relationships summarized in the Burt matrix, based on

crisp or fuzzy coded data. Then we can use JCA, already existent for crisp MCA, to explain the

off-diagonal tables which cross-tabulate distinct pairs of variables, ignoring the cross-tabulations

of each variable with itself (Section 4.5 and 4.6).

The proposed approaches will be illustrated using meteorological data from 40 cities of Turkey. We

first describe the coding of these data in Section 2 and various “standard” results of analysing these

data in Section 3, using principal component analysis (PCA), MCA and fuzzy MCA. Then we will

describe the properties of each of the above measures of fit in Section 4, with a discussion in Section 5

where we also giving some guidelines as to the circumstances when one will be more preferable than

the others, and some proposals for the analysis of mixed continuous–categorical data.

4

2. Continuous data and their recoding for CA

Table 1 contains average values of five meteorological variables for 40 cities of Turkey, based on

measurements taken in 2004 (Turkey Statistical Yearbook, 2004) – note that we use this example as a

convenient illustration of our approach rather than a substantive meteorological application.

There are several standard ways in the CA literature to recode continuous data in a form suitable for

CA, for example re-expressing each variable relative to its range, or replacing data values by their

ranks (see Greenacre, 2007: chapter 23); here we consider the discretization of the continuous scales

into a set of categorical ones. In other words, suppose we define three categories for each variable,

which we could call “low” (1), “middle” (2) and “high” (3), then we can replace each value by its

category number and use MCA to analyse the data. For each variable two boundary points need to be

chosen in order to divide the scale into three intervals. These boundary points can be chosen in

different ways, and thanks to the principle of distributional equivalence in CA (see, for example,

Greenacre, 2007: pages 37–38), the particular choice is not so critical. Clearly this recoding of the

data loses a lot of information, so an alternative fuzzy coding can be performed which conserves more

of the information in the data while still maintaining the low-middle-high idea. Again, there are

several ways of coding the data coding into fuzzy categories, called fuzzification (as opposed to

discretization), described in Loslever and Bouilland (1999) and Murtagh (2005), for example. We

have chosen the system of so-called “three-point triangular membership functions” shown in Figure 1

(also called piecewise linear functions, or second order B-splines – see van Rijckevorsel (1988) for a

theoretical account of this topic). The three values of the minimum, median and maximum of the

variable are used as so-called “hinge” points to define triangles which convert the continuous scale

into values on a 0 to 1 scale such that the membership values sum to 1. The first and last triangles are

“shouldered”, with maxima of 1 at the continuous variable’s minimum and maximum values

respectively, while the middle triangle has a maximum of 1 at the median. The definition of the set of

triangular membership functions is thus as follows, for the input continuous variable x which has

minimum value m1, median m2 and maximum value m3:

>

−

−

=

>

−

−

≤

−

−

=

≤

−

−

=

otherwise 0

for ,

)(

otherwise 0

for ,

for ,

)(

otherwise 0

for ,

)(

2

23

2

32

23

3

2

12

1

2

2

12

2

1

mx

mm

mx

xzmx

mm

xm

mx

mm

mx

xz

mx

mm

xm

xz

These functions give the three degrees of membership that code the variable x into the three fuzzy

categories. Notice that a particular input value will belong to no more than two fuzzy categories, and

that the membership degrees sum to 1. Alternatives to triangular membership functions can be

(1)

Page 3

5

trapezoidal, Gaussian and generalized bell (or Cauchy) membership functions, which have various

theoretical advantages (see, for example, Jang, Sun and Mizutani, 1997; Senturk, 2006), and we could

also consider more than three fuzzy categories. These specific aspects about fuzzy coding have been

dealt with extensively in the literature – see, for example, Zhou, Purvis and Kazabov (1997) and

Verkuilen (2005) for a discussion of choice of membership functions (the latter reference gives a more

statistical treatment of the subject and shows connections to psychometric scaling). Here we have

chosen one of the simplest forms of coding for illustrative purposes, since our purpose is to deal with

properties of the CA of such data.

The fuzzy coded data are given in Table 2. To compare this with crisp coded indicator data, we

constructed a table consisting of zero-one dummies where two out of the three categories are zeros and

the value of one is assigned to the category with highest membership value according to Table 2 – that

is, the boldface values in Table 2 are replaced by 1s and the rest 0s.

6

3. PCA, MCA and fuzzy MCA

We analyse the three forms of the data in Tables 1, 2 and the crisp data to make an initial comparison

of their results, focusing on the value and interpretation of the standard measure of fit in each case.

3.1 Principal component analysis (PCA)

Figure 2 shows the principal component analysis (PCA) of Table 1, where the data have been

standardized, and where the map has been scaled to be a biplot (see, for example, Gower and Hand,

1996). The map explains 75.6% of the variance, but what exactly does this measure of fit mean?

Taking the view of PCA as a method of matrix approximation, we can think of Figure 2 as a way of

approximating the standardized data. Suppose that the original data are denoted by the 40×5 matrix X

= [xij], and the standardized data by Y = [yij], with elements

jj ij ij

sxxy

/ )(

−=

, where

jj

sx

and

are

the mean and standard deviation respectively of the j-th column variable. Then the PCA of Y is based

on the singular-value decomposition (SVD):

IVVUU

V UDY

===

TTT

where

1

α

n

(2)

where n = 40. The biplot of Figure 2 is constructed using the first two columns of UDα and √nV ,

denoted by F (40×2) and Γ Γ Γ Γ (5×2) respectively, so that approximated standardized values

ij yˆ can be

computed by calculating all 200 scalar products between the two sets of points:

2211

ˆ

y

jiji ij

ff

γγ+=

,

i=1,…,40; j=1,…,5, i.e.

T

FΓ

Y =

ˆ

This is usually thought of geometrically as a projection of the row

(city) points onto the vectors defined by the variables , with appropriate calibration of the axes which

depends on the length of these vectors (Gabriel and Odoroff, 1990). The total variance of the matrix Y,

with value 5 in this case because there are five variables each with variance 1, can be decomposed into

two parts, thanks to the orthogonality between the solution estimates

ij yˆ and the errors

ij ij

y

ˆ

y

−

:

∑ ∑

i

∑ ∑

i

∑ ∑

i

−+=

j

ijij

j

ij

j

ij

y

ˆ

y

n

y

ˆ

n

y

n

222

)(

1

1

1

(3)

i.e., in our example 5 = 3.782 + 1.218; hence the figure of 75.6% is the explained part (3.782) of the

decomposition relative to the total, whereas the unexplained, or error, part (1.218) is 24.4% of the

total. Literally, 75.6% tells us how close the map is to the standardized data in the least-squares sense.

Putting this another way by using an analogy with regression analysis, the two principal axes of the

map are orthogonal latent variables predicting the data in Y, the first axis explaining 43.3% of the

variance, and the second an additional 32.3%, that is a coefficient of determination R2 = 0.756. The

explained variance of 3.782 can be further decomposed into parts for the five variables in order to see

Page 4

7

which variables are explained better than others: 3.782 = 0.831 + 0.793 + 0.620 + 0.801 + 0.737 (for

perfect explanation of a variable, its explained part would be 1).

3.2 Multiple correspondence analysis (MCA)

Figure 3 shows the CA of the crisp data, in other words the MCA of the indicator matrix Z (n×Q),

where Q = 5 is the number of variables. This analysis is based on the SVD of the centred and scaled

matrix (see, for example, Greenacre, 2006, p. 52):

IVVUU

V UDDD 11Z

===−

−

TTTT

where)

1

n

(

1

2 / 1

α

nQ

(4)

where D is the diagonal matrix of column masses cj, i.e., the marginal frequencies of each category

relative to their grand total of nQ = 200. The asymmetric map in Figure 3 is constructed from

coordinates in the first two columns of √nUDα and D-1/2V, again denoted by F (40×2) and Γ Γ Γ Γ (15×2).

The two principal axes explain 26.2% and 19.7% of the inertia, totalling 45.9% for the two-

dimensional map. This is much less than the PCA, and once more we have to ask what this measure

of quality really means. Again, this is a measure of how well the data are reconstructed from the map,

but the data themselves are just zeros and ones whereas the reconstructed data from the MCA map are

real numbers. Data are again reconstructed using scalar products between these two sets of points as

follows (see, for example, Greenacre, 2007, page 101):

) 1 (

j

ˆ

z

2211

jiji ij

ff Qc

γγ++=

, or in matrix terms:

DFΓ

11

(

Z

ˆ

)

TT+= Q

. (5)

For example, the reconstructed data for the variable “Sun” in the city of Adana can be calculated from

the two-dimensional coordinates of Sun1, Sun2 and Sun3 and Adana to be 0.0622, 0.5630 and 0.3748

(notice that these approximated values sum to 1 – they are, in fact, fuzzy). The explained variance of

45.9% again measures how close the approximations [ 0.0622 0.5630 0.3748 ] are to the “observed”

indicator values [ 0 1 0 ], evaluated globally over all entries in Table 3. It is now understandable why

the measure of fit is doomed to be low, because the reconstructed data will always be fuzzy while the

original data is discrete.

MCA is often defined alternatively as the CA of the Burt matrix, the matrix composed of all two-way

cross-tabulations of the variables: B = ZTZ. In our example this will be a 15×15 matrix formed of 25

3×3 tables cross-tabulating all pairs of the five categorical variables. The standard coordinates of the

variables, i.e. the coordinates of the categories in Figure 3, are identical in this analysis, but the

eigenvalues are the squares of those of the indicator matrix, leading to an apparent improvement in the

percentages of inertia explained. In our example the percentages of inertia in two dimensions are now

8

43.7% and 24.9%, giving a total of 68.6%. This way of measure of fit can still be criticized as being

low, however, due to an artefact of the Burt matrix B (Greenacre, 1991). The tables down the

diagonal of the B, which cross-tabulate each variable with itself, have high inertias and these

artificially inflate the total inertia of B – in fact, in this example these diagonal tables account for

63.8% of the inertia of B. Percentages of explained inertia are thus low because the denominator in

the calculation is artificially high, and a way to resolve this is to avoid fitting these diagonal blocks of

B (see Sections 4.5 and 4.6 later).

3.3 Fuzzy MCA

Figure 4 shows the CA of the fuzzy coded data of Table 2, or fuzzy MCA. The theory here is identical

to that of MCA given in Section 3.2, simply replacing Z in (4) by the fuzzy coded matrix Z*, and

recomputing the column margins whose relative values define the diagonal matrix D. Several

properties of crisp MCA carry over to fuzzy MCA:

(a) Each set of categories has weighted average at the origin of the map, because the sum of each

set of fuzzy coded values is also 1.

(b) As in any CA, the row and column sums of the approximated data, reconstructed from the

low-dimensional solution (using (5) in the present context), are the same as the original data.

In addition, in the reconstructed fuzzy data, each set of membership levels sums to 1; as in the

case of crisp MCA, these sets of values are not necessarily all positive.

(c) A fuzzy Burt matrix can also be computed as B* = Z*TZ*: the CA of B* also has identical

standard (row or column) category coordinates to those (columns) of Z*, and the principal

inertias of B* are equal to the squares of those of Z*.

Since the fuzzy coded matrix Z* has less zeros than the indicator matrix, its total inertia will be lower

– in the example it is 1.025, compared to the value of exactly 2 for Z‡. Since both the input data and

the approximated values from the fuzzy MCA map are fuzzy, it is to be expected that fuzzy MCA will

do better in the reconstruction of data, in terms of the usual measure of fit. In the present example the

percentage of inertia explained in the two-dimensional map is 55.0%, about 9 percentage points higher

than the MCA of the crisp indicator matrix. Notice how much better separated the cities are from one

another in Figure 4 compared to Figure 3 – the cities that coincided in Figure 3 are now separated.

But our thesis is that the value of 55.0% variance explained is still artificially low, and can be

improved by reconsidering the way this quantity is measured in fuzzy MCA.

‡ The inertia of an indicator matrix constructed from Q variables that give rise to J categories is equal to

(J–Q )/Q (see Benzécri (1973) or Greenacre (1984) for a proof).

Page 5

9

4. Measures of fit in crisp and fuzzy MCA

In this section we consider three alternative measures of fit introduced in Section 1, and apply each of

them to crisp and fuzzy MCA respectively.

4.1 Predicting categories: crisp MCA

In regular (crisp) MCA, the coding indicates the categories of each variable to which each individual

(a city in our application) belongs. An obvious measure of the quality of the solution is to count how

many of these categories are correctly predicted. For example, from the approximated values

[ 0.0622 0.5630 0.3748 ] given in Section 3.2 for the city of Adana, we would predict that Adana is

in category Sun2, which is a correct prediction and in this sense without error. Performing this

calculation for the whole table we obtain 171 correct predictions out of the total of 200, giving a

prediction accuracy of 85.5%, much higher than the usual measure of fit of 44.1% from the MCA of

the indicator matrix. The process of converting the fuzzy values in the approximated vector to discrete

ones, for example converting [ 0.0622 0.5630 0.3748 ] to [ 0 1 0 ], is called “defuzzification”, in this

case defuzzifying to discrete indicator data. Hence, once the approximated values from the MCA map

are defuzzified, the quality of the map is 85.5% as far as predicting the categories of the continuous

variables is concerned. The quality is much lower if we defuzzify the results of the MCA back to

estimations of the original data, as we shall describe in Section 4.3.

4.2 Predicting categories: fuzzy MCA

Taking the example of sunshine in Adana again, the fuzzy coded data are (from Table 3)

[ 0 0.634 0.366 ], while the approximated data are computed from the coordinates in Figure 4, using

the same reconstruction formula (5) as in crisp MCA, to be [ 0.112 0.393 0.496 ]. Thepercentage

55.0% computed in Section 3.3 is an evaluation of how close the approximations are to their original

fuzzy values, computed over the whole data set. If we now had to predict the category with highest

membership value (sunshine category Sun2), then we would be wrong: the highest reconstructed value

is 0.496 for category Sun3. We counted the correct category predictions using the fuzzy MCA map in

this way, and obtained 168 correct predictions out of 200, that is an accuracy of 84.0%, slightly less

than in the crisp case. In other words, defuzzifying (to discrete indicator values) both the fuzzy data

and their approximations, we obtain a prediction accuracy of 84.0%. Although the fuzzy MCA is

approximating the fuzzy data better (55.0% inertia explained, compared to 45.9% for crisp MCA), it is

giving slightly less accuracy in predicting the categories with highest degree of membership – this is

intuitively understandable because the indicator matrix codes these categories crisply, so should do

better in predicting them.

10

4.3 Estimating original data: crisp MCA

Since the approximations we obtain from crisp and fuzzy MCA solutions are fuzzy, adding up to 1 in

each case, we can defuzzify them to estimate the original variables. For example, consider again the

approximation of sunshine in Adana from the crisp MCA solution: [ 0.062 0.563 0.375 ]. Using

these values to weight the minimum (4.14), median (7.10) and maximum (8.33) hours of sunshine, we

obtain the estimate: 0.062×4.14 + 0.563×7.10 + 0.375×8.33 = 7.38, compared to the original value of

7.55. The rationale for computing the estimate in this way is that this weighted average calculation is

exactly the way we invert the fuzzification of the data according to the triangular membership

functions (see the next section).

We need to standardize the data and their estimations to be able to measure the fit, as we did in PCA.

The problem with this defuzzification is that the estimated values do not have the same variable means

as the original data, neither are the errors orthogonal to the estimated values, so we do not obtain a

decomposition of the form (3). The best we can do in this situation is to calculate a stress measure in

the form of the sum-of-squared errors, which is equal to 2.25 in this case, and then express it relative

to the total variance of 5, i.e., 44.5% stress. This error percentage is still better than the unexplained

inertia of 54.1% in the case of crisp MCA (its explained inertia was 45.9%), even though one would

not expect the crisp MCA to recover the original data well when only coarse zero-one indicator data

have been fitted. This is another demonstration that the explained variance in MCA is pessimistic,

even by this very conservative measure of error percentage. We would, however, expect the fuzzy

MCA to perform better on this criterion, as explained next, because the input fuzzy data code the

continuous data precisely.

4.4 Estimating original data: fuzzy MCA

In our application the fuzzy coding of the data is a 1-to-1 mapping, thanks to the use of triangular

membership functions, and the process can be reversed, that is, defuzzified back to the original data.

Using the same example as in the previous paragraphs, Adana’s original sunshine value of 7.55 can be

recovered back from the fuzzy values as the centroid of the minimum, median and maximum values,

using its fuzzy values as weights: 7.55 = 0×4.14 + 0.634×7.10 + 0.366×8.33 (this is a direct result of

(1)). In the same way, we can obtain a corresponding estimate from the fuzzy approximation

[ 0.112 0.393 0.496 ]: 0.112×4.14 + 0.393×7.10 + 0.496×8.33, which is equal to 7.39. After

performing this defuzzification for all 200 data values, several interesting results emerge:

1. The defuzzified estimates do have the same (column) means as the original data, and these

means are equal to weighted averages, like those above, using the means of the columns of Z*

as weights (the column means of Z* are the same as those of Z). For example, the means of

the first three columns of Z* corresponding to temperature are 0.191, 0.613 and 0.196: