Green Acre 2007 Caco l

7/21/2019 Green Acre 2007 Caco l

1/47

Correspondence AnalysisCorrespondence AnalysisCorrespondence AnalysisCorrespondence Analysis

and Related Methodsand Related Methodsand Related Methodsand Related MethodsMichael GreenacreUniversitat Pompeu [email protected]

www.globalsong.net www.econ.upf.es/~michael

1961

1973

1984

1989

1991

1993

1999

2002

1994

1998 B

C

A

2007

First XLSTAT Users Conference

Paris, 20077-8 June 2007


2/47


3/47

Jean-Paul Benzcri... creator of Correspondence Analysis


4/47

Correspondence analysis:

in which areas of research is it useful?

CA visualizes complex data, primarily data on categoricalmeasurement scales, facilitating understanding and

interpretation a neglected aspect of statisticalenquiry (cf. usual modelling approach)

linguistics, textual analysis: word frequencies

sociology: cross-tabulations and large sets ofcategorical data from questionnaires; useful forqualitative research, visualization of case study data

ecology: species abundance data at several

locations, often with explanatory variables market research: perceptual mapping of

brands/products, ...

archeology: large sparse data matrices

biology, geology, chemistry, psychology...


5/47

Correspondence Analysis (CA) CA is a method of data visualization

O

O

O

O

O

OO

O

O

O

It applies in the first instance to a cross-tabulation(contingency table)but can be applied to many other data types after suitable recoding

The results of CA are in the form of a mapof points

The points represent the rows and columns of the table; it is not theabsolute values which are represented (as in principal componentanalysis, for example) but their relativevalues.

The positions of the points in the map tell you something aboutsimilarities between the rows, similarities between the columns and theassociation between rows and columns


6/47

A simple example 312 respondents, all readers of a certain newspaper, cross-tabulated

according to their education group and level of reading of thenewspaper

E1

E2

E3

E4

E5

C1 C2 C3

E1: some primary E2: primary completed E3: some secondaryE4: secondary completed E5: some tertiary

C1 : glance C2: fairly thorough C3: very thorough

1673

494012

392919

204618

275

E5

E4

E3

E2

E1

C3

C2

C1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

0.0704 (84.52 %)

0.0129 (15.48 %)


7/47

Three basic geometric concepts

profile mass

distance

profile the coordinates (position) of the point

mass the weight given to the point

distance the measure of proximity between points


8/47

Four derived geometrical concepts

inertia the weighted sum-of-squared distances to centroid

centroid the weighted average position

projection the closest point in the subspace

centroido

projection

subspace

subspace space of reduced dimensionality within the space

o

inertia

mi di

inertia = i midi2


9/47

Profile

A profile is a set of relative frequencies, that is a set of frequenciesexpressed relative to their total (often in percentage form).

Each row or each column of a table of frequencies defines a different

profile.

It is these profiles which CA visualises as points in a map.

E1

E2

E3

E4

E5

C1 C2 C3

1673

494012

392919

204618

275

original data

E1

E2

E3

E4

E5

C1 C2 C3

.62.27.12

.49.40.12

.45.33.22

.24.55.21

.14.50.36

row profiles

E1

E2

E3

E4

E5

C1 C2 C3

.13.05.05

.39.31.21

.31.22.33

.16.37.32

.02.05.09

column profiles

14

84

87

101

26

1

1

1

1

1

57 129 126 312 1 1 1


10/47

Row profiles viewed in 3-d


11/47

Plotting profiles in profile space

(triangular coordinates)E1 0.36 0.50 0.14

0.36

0.50

0.14


12/47

Weighted average (centroid)

average

The average is the point at which the two points are balanced.

weighted average

The situation is identical for multidimensional points...


13/47


(barycentric or weighted average principle)

E1 0.36 0.50 0.14

0.360.50

0.14


14/47



E2 0.21 0.55 0.24

0.210.55

0.24


15/47



E5 0.12 0.27 0.62

0.120.27

0.62


16/47

Masses of the profiles

E1

E2

E3

E4

E5

C1 C2 C3

1673

494012

392919

204618

275

original data

14

84

87

101

26

57 129 126 312

.045

.269

.279

.324

.083

1

masses

.183 .413 .404 1averagerow profile


17/47

Readership data

57

(0.183)

3

(0.115)

12

(0.119)

19

(0.218)

18

(0.214)

5

(0.357)

C1

312126

(0.404)

129

(0.413)Total

0.08326

16

(0.615)

7

(0.269)Some tertiaryE5

0.32410149

(0.485)

40

(0.396)Secondary completedE4

0.27987

39

(0.448)

29

(0.333)Some secondaryE3

0.2698420

(0.238)

46

(0.548)Primary completedE2

0.045142

(0.143)

7

(0.500)

Some primaryE1

MassTotalC3C2Education Group

C1: glance C2: fairly thorough C3: very thorough


18/47

Calculating chi-square

2 = 12 similar terms ....

+(3 - 4.76) 2

+(7 -10.74) 2

+(16 -10.50) 2

4.76 10.74 10.50

.87......

.84......

.14......

57(0.183)

3

(0.115)

4.76

.

C1

312126

(0.404)129

(0.413)Total

0.08326

16

(0.615)

10.50

7

(0.269)

10.74

Observed Frequency

Some tertiary

Expected Frequency

E5

.101.....


For example,

expected frequencyof (E5,C1):

0.183 x 26 = 4.76

= 26.0


19/47

Calculating chi-square

2 = 12 similar terms ....

+ 26 [ (3 / 26 - 4.76 / 26)2

+ (7 / 26 -10.74 / 26)2

+ (16 / 26 -10.50 / 26)2

]4.76 / 26 10.74 / 26 10.50 / 262/ 312 = 12 similar terms ....

+ 0.083[ (0.115 0.183)2

+ (0.269 0.413)2

+ (0.615 0.404)2

]0.183 0.413 0.404

.87......

.84......

.14......

57(0.183)

3

(0.115)

4.76

.

C1

312126

(0.404)129

(0.413)Total

0.08326

16

(0.615)

10.50

7

(0.269)

10.74

Observed Frequency

Some tertiary

Expected Frequency

E5

.101.....



20/47

Calculating inertia

Inertia = 2/312 = similar terms for first four rows ...

+ 0.083[ (0.115 0.183) 2 + (0.269 0.413) 2 + (0.615 0.404) 2 ]0.183 0.413 0.404

mass(of row E5)

squared chi-square distance(between the profile of E5 and

the average profile)

Inertia = mass (chi-square distance)2

(0.115 0.183) 2

+(0.269 0.413) 2

+(0.615 0.404) 2 EUCLIDEAN

0.183 0.413 0.404 WEIGHTED


21/47

How can we see chi-square distances?

Inertia = 2/312 = similar terms for first four rows ...

+ 0.083[ (0.115 0.183) 2 + (0.269 0.413) 2 + (0.615 0.404) 2 ]0.183 0.413 0.404mass

(of row E5)

squared chi-square distance(between the profile of E5 and

the average profile)

(0.115 0.183) 2 + (0.269 0.413)2

+ (0.615 0.404)2 EUCLIDEAN

0.183 0.413 0.404 WEIGHTED

( 0.115 0.183 )2

+ ( 0.269 0.413 )2

+ ( 0.615 0.404 )2

So the answer is to divide all profile elements by the of their averages

0.183 0.183 0.413 0.413 0.404 0.404


22/47

Stretched row profiles viewed in

3-d chi -squared space

Pythagorian ordinary Euclidean

distances

Chi-square distances

profiles

vertices


23/47

What CA does

centres the row and column profiles with respect to their averageprofiles, so that the origin represents the average.

re-defines the dimensions of the space in an ordered way: firstdimension explains the maximum amount of inertia possible in onedimension; second adds the maximum amount to first (hence first twoexplain the maximum amount in two dimensions), and so on untilall dimensions are explained.

decomposes the total inertia along the principal axes into principalinertias, usually expressed as % of the total.

so if we want a low-dimensional version, we just take the first(principal) dimensions

The row and column problem solutions are closely related,one can be obtained from the other; there are simple scaling

factors along each dimension relating the two problems.


24/47

Asymmetric Maps using XLSTAT

E5

E4

E3

E2

E1

C3

C2

C1

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -0.5 0 0.5 1 1.5

.07037 (84 ,5%)

.01289 (15,5%)

E1

E2

E3

E4

E5C1

C2

C3

-1

-0.5

0

0.5

1

1.5

2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

.07037 (84,5%)

.01289 (15,5%)


25/47

Symmetric Map using XLSTAT

some

tertiary

secondary

complete

secondaryincomplete

primary

complete

primary

incomplete

very thorough

fairly thorough

glance

-0.2

0

0.2

-0.6 -0.4 -0.2 0 0.2 0.4 0.6

.07037 (8 4,5%)

.01289 (15,5%)


26/47

Asymmetric and symmetric maps

Asymmetric maps represent the rows and columns jointly in

principal & standard coordinates; asymmetric maps are alsobiplots.

Because the principal coordinates can be much smaller than

the standard coordinates, especially whenk is small, thegenerally accepted way for the joint map is the symmetric map,where both rows and columns are in principal coordinates.

Symmetric maps are strictly speaking not biplots, but they

are almost so (see Gabriel, Biometrika, 2002).


27/47

Data set product(McFie et al.)

Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd

A 3 16 14 13 14 18 6 18

B 1 15 6 8 10 13 14 9

C 13 11 4 13 11 4 10 2

D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15

F 3 16 14 15 12 14 7 16

G 18 12 13 16 13 5 4 7

H 2 14 7 6 10 4 14 8

I 10 14 13 12 14 16 4 8

ours 4 15 15 16 14 7 6 15

Our company wishes to identify the perceptions of itself and its nine majorcompetitors.

Data are gathered from representatives from 18 companies that represent

their potential client base: each has to say which companies theyassociate with which of 8 attributes.

The aim is to gain an idea about the relationships between the competitorsand the attributes, and where our company is situated in the overall

scheme.


28/47

Reduction of dimensionality


29/47


data centred

means


30/47


data centred

points weighted (row masses)

in case of frequency data, points are weighted by

their row masses, that is the relative frequencies of

each row (i.e. proportional to sample sizes, n)


31/47


data centred

points weighted (row masses)

metric weighted (column weights)

dii'2 = j wi (yij yi'j )2

i

i'

e.g. wj = 1/j2

the inverse of the variance in PCAw = 1/c the inverse of the expected value in CA


32/47

Fat Freddys Cat Dimensional Transmogrifier

with thanks to Jrg Blasius


33/47


Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd

A 3 16 14 13 14 18 6 18

B 1 15 6 8 10 13 14 9

C 13 11 4 13 11 4 10 2

D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15

F 3 16 14 15 12 14 7 16

G 18 12 13 16 13 5 4 7

H 2 14 7 6 10 4 14 8

I 10 14 13 12 14 16 4 8

ours 4 15 15 16 14 7 6 15

Our company wishes to identify the perceptions of its products and its 9major competitors (A, B, , I).

Data are gathered from representatives from 18 companies that represent

their potential client base: each has to say which products they associatewith which of 8 attributes.

The aim is to gain an idea about the relationships between the competitorsand the attributes, and where our company is situated in the overallscheme.

Products


34/47


First note that this is NOT a contingency table, so the chi-square test is notapplicable (a permutation test could test for significance, but then we needto have original respondent-level data).

This is an interesting example because it can be analyzed as is or it canbe recoded to bring out certain features.

Analyzing it with no recoding means that the size effect (sometimescalled the halo effect) is removed, since we analyze profiles, i.e., the

counts relative to their total counts. In other words, if a product getsrelatively few associations, then it is the highest of these (lower)associations that are determinant. Hence, in the following extreme case,a pattern of [ 18 18 18 ] is identical to a pattern of [ 1 1 1 ] !

The masses assigned to the products will be proportional to the number ofassociations they get.

If the size effect is needed to be visualized as well, the data table should

be doubled.

D t t d t


35/47

Data set product(McFie et al.)Company PQ In PR En PL MI PS GP Total

A 3 16 14 13 14 18 6 18 102

B 1 15 6 8 10 13 14 9 76

C 13 11 4 13 11 4 10 2 68

D 9 11 4 9 11 9 11 3 67

E 6 14 15 17 14 16 8 15 105

F 3 16 14 15 12 14 7 16 97G 18 12 13 16 13 5 4 7 88

H 2 14 7 6 10 4 14 8 65

I 10 14 13 12 14 16 4 8 91

ours 4 15 15 16 14 7 6 15 92

Company PQ In PR En PL MI PS GP Total

A 2.9 15.7 13.7 12.7 13.7 17.6 5.9 17.6 102

B 1.3 19.7 7.9 10.5 13.2 17.1 18.4 11.8 76

C 19.1 16.2 5.9 19.1 16.2 5.9 14.7 2.9 68

D 13.4 16.4 6.0 13.4 16.4 13.4 16.4 4.5 67

E 5.7 13.3 14.3 16.2 13.3 15.2 7.6 14.3 105

F 3.1 16.5 14.4 15.5 12.4 14.4 7.2 16.5 97

G 20.5 13.6 14.8 18.2 14.8 5.7 4.5 8.0 88

H 3.1 21.5 10.8 9.2 15.4 6.2 21.5 12.3 65

I 11.0 15.4 14.3 13.2 15.4 17.6 4.4 8.8 91ours 4.3 16.3 16.3 17.4 15.2 7.6 6.5 16.3 92

Products

Products

D t t d t ( l )


36/47


Com. PQ PQ- In In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- Total

A 3 15 16 2 14 4 13 5 14 4 18 0 6 12 18 0 144B 1 17 15 3 6 12 8 10 10 8 13 5 14 4 9 9 144

C 13 5 11 7 4 14 13 5 11 7 4 14 10 8 2 16 144

D 9 9 11 7 4 14 9 9 11 7 9 9 11 7 3 15 144

E 6 12 14 4 15 3 17 1 14 4 16 2 8 10 15 3 144

F 3 15 16 2 14 4 15 3 12 6 14 4 7 11 16 2 144G 18 0 12 6 13 5 16 2 13 5 5 13 4 14 7 11 144

H 2 16 14 4 7 11 6 12 10 8 4 14 14 4 8 10 144

I 10 8 14 4 13 5 12 6 14 4 16 2 4 14 8 10 144

ours 4 14 15 3 15 3 16 2 14 4 7 11 6 12 15 3 144

Doubling involves coding the counts of the numbers (out of 18) thatDONT associate the product with the attribute in each case.

There are now two columns per attribute each attribute is represented byits positive and negative end of the 0-to-18 scale of counts.

Doubled table:

Prod.

Row asymmetric map


37/47

ours I

H

G

FE

D CB

A

GlobProd

PriceSens

ModImage

PriceLevel

Environm

ProdRange

Innovatn

ProdQual

-2

-1

0

1

2

3

-2 -1 0 1 2 3

0.0765 (53.1%)

0.0478 (33.2 %) Row points are

projections ofrow profiles have inertiasalong axes equalto principalinertias (henceprincipalcoordinates).

Column pointsare projections of

extreme cornerprofiles, orvertices (cf.triangle) have inertiaalong axes equalto 1 (hencestandardcoordinates).

Profile points

generally closeto average.

Row asymmetric map


38/47

Row pointsand columnpoints are bothdisplayed in

principalcoordinates both haveinertias alongaxes equal toprincipalinertias.

Both sets ofpoints occupysimilar regions

of the map:aesthetically abetter graphic.

Symmetric map

GlobProd

PriceSens

ModImage

PriceLevel

Environm

ProdRange

Innovatn

ProdQualours

I

H

G

FE

D

C

B

A

-0.4

-0.2

0

0.2

0.4

0.6

-0.4 -0.2 0 0.2 0.4 0.6 0.8

0.0765 (53.1%)

0.0478 (33.2%)


39/47

Attributes havepositive andnegative pole average

association is atthe origin of themap, e.g.,In(novation) hashigh average,P(roduct)Q(uality)has low average.

Fairly similarconfiguration toundoubled

analysis: there isno strong haloeffect.

Doubled data: symmetric map

GP-

GP

PS-

PS

MI-

MI

PL-

PL

En-

En

PR-

PR In-

In

PQ-

PQ

ours

I

H

G

F

E

D

C

B

A

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

0.1173 (54.5%)

0.0682 (31.7%)Highproductquality

High price sensitive;low environment,product range andprice level

Highproductrange,modernimage,

globalproducts


40/47

Inertia contributions in CA

Correspondence analysis (CA) is a method of data visualization whichrepresents the true positions of profile points in a map which comesclosest to all the points, closest in sense of weighted least-squares.

O

O

OO

O

OO

O

O

O

The inertia explained in the map applies to all the points: if we say83% of the inertia is explained in the map, 71% on the firstdimension and 12% on the second, this is a figure calculated for allrow (or column) points together.

71%

12%


41/47

Inertia contributions in CA

This type of inertia-explained-by-axes calculation can be made forindividual points.

These more detailed results are aids to interpretation in the form ofnumerical diagnostics, called contributions.

Especially when there is not a high percentage of inertia explained by themap, these contributions will help us to identify points which are

represented inaccurately. The inertias and their percentages tell us how much of the variance in

the table is explained by the principal axes. The contributions do thesame, but for each point individually, and help us to see:

(a) which points are being explained better than others;(b) which points are contributing to the solution more than others.


42/47

Geometry of inertia contributions

centroid c

i-th point aiwith mass mi

k-th principalaxis

projection on

axis

di

fik

Total inertia of the cloud of points = i mi di2 = i mi kfik

2 = kk

Inertia of i-th point = mi di2 = mi kfik

2

Inertia contribution of i-th point to k-th axis = mifik2


43/47

Geometry of inertia contributions

centroid c


k-th principalaxis

projection on

axis

di

fik

Total inertia of the cloud of points = i mi di2 = i mi kfik

2 = kk

Inertia of i-th point = mi di2 = mi kfik

2

Inertia contribution of i-th point to k-th axis = mifik2

m1f112 m1f12

2 ... m1f1p2

m2f212 m2f22

2 ... m2 f2p2

m3f312 m3f322 ... m3f3p2: : :

: : :

: : :

: : :

mnfn12 mnfn2

2 ... mnfnp2

1

2

3

n

Axes

1 2 ... p

m1 d12

m2 d22

m3 d32:

:

:

mn dn2

1 2 ... p

b


44/47

Inertia contributions

centroid c


k-th principalaxis

projection on

axis

di

fik

m1f112 m1f12

2 ... m1f1p2

m2f212 m2f22

2 ... m2 f2p2

m3f312 m3f322 ... m3 f3p2: : :

: : :

: : :

: : :

mnfn12 mnfn2

2 ... mnfnp2

1

2

3

n

Axes1 2 ... p

m1 d12

m2 d22

m3 d32:

:

:

mn dn2

1 2 ... p

mifik2/ k : amount of inertia of axis k explained by point i (absolute contribution, CTR)

mifik2/ midi

2 : amount of inertia of point i explained by axis k (relative contribution, COR)

mifik2/ midi2 = fik2/ di2 , i.e. the square offik/ di = cos(ik), whereik is the angle point-axis

ik

Contributions to axes and


45/47

Contributions to axes and

contributions to points(product data, doubled)Contributions (rows):

Weight (relativ F1 F2A 0.100 0.200 0.010

B 0.100 0.006 0.266

C 0.100 0.249 0.031

D 0.100 0.153 0.011

E 0.100 0.113 0.010

F 0.100 0.113 0.004

G 0.100 0.037 0.414

H 0.100 0.074 0.202

I 0.100 0.009 0.044ours 0.100 0.048 0.010

Squared cosines (rows):

F1 F2A 0.922 0.027

B 0.033 0.914

C 0.901 0.065

D 0.856 0.035

E 0.827 0.045

F 0.929 0.017

G 0.129 0.839

H 0.320 0.510

I 0.087 0.259

ours 0.389 0.046

Eigenvalues and percentages of inertia:

F1 F2

Eigenvalue 0.117 0.068

Rows depend 54.482 31.656

Cumulative % 54.482 86.139

Not so well-represented

After: Correspondence Analysis in the


46/47

CARME 2007

Correspondence Analysis &

Related Methods

Erasmus University

Rotterdam

25-27 June 2007

http://www.carme-n.org

Correspondence Analysis in theSocial Sciences (Cologne,1991)

Visualizing Categorical Data(Cologne, 1995)

Large Scale Data Analysis(Cologne, 1999)

Correspondence Analysis and

Related Methods (CARME 2003)(Barcelona, 2003)

Just pubished by


47/47

Just pubished byChapman & Hall /

CRC Press

Green Acre 2007 Caco l

Documents

Transcript of Green Acre 2007 Caco l