Green Acre 2007 Caco l
Transcript of Green Acre 2007 Caco l
-
7/21/2019 Green Acre 2007 Caco l
1/47
Correspondence AnalysisCorrespondence AnalysisCorrespondence AnalysisCorrespondence Analysis
and Related Methodsand Related Methodsand Related Methodsand Related MethodsMichael GreenacreUniversitat Pompeu [email protected]
www.globalsong.net www.econ.upf.es/~michael
1961
1973
1984
1989
1991
1993
1999
2002
1994
1998 B
C
A
2007
First XLSTAT Users Conference
Paris, 20077-8 June 2007
-
7/21/2019 Green Acre 2007 Caco l
2/47
-
7/21/2019 Green Acre 2007 Caco l
3/47
Jean-Paul Benzcri... creator of Correspondence Analysis
-
7/21/2019 Green Acre 2007 Caco l
4/47
Correspondence analysis:
in which areas of research is it useful?
CA visualizes complex data, primarily data on categoricalmeasurement scales, facilitating understanding and
interpretation a neglected aspect of statisticalenquiry (cf. usual modelling approach)
linguistics, textual analysis: word frequencies
sociology: cross-tabulations and large sets ofcategorical data from questionnaires; useful forqualitative research, visualization of case study data
ecology: species abundance data at several
locations, often with explanatory variables market research: perceptual mapping of
brands/products, ...
archeology: large sparse data matrices
biology, geology, chemistry, psychology...
-
7/21/2019 Green Acre 2007 Caco l
5/47
Correspondence Analysis (CA) CA is a method of data visualization
O
O
O
O
O
OO
O
O
O
It applies in the first instance to a cross-tabulation(contingency table)but can be applied to many other data types after suitable recoding
The results of CA are in the form of a mapof points
The points represent the rows and columns of the table; it is not theabsolute values which are represented (as in principal componentanalysis, for example) but their relativevalues.
The positions of the points in the map tell you something aboutsimilarities between the rows, similarities between the columns and theassociation between rows and columns
-
7/21/2019 Green Acre 2007 Caco l
6/47
A simple example 312 respondents, all readers of a certain newspaper, cross-tabulated
according to their education group and level of reading of thenewspaper
E1
E2
E3
E4
E5
C1 C2 C3
E1: some primary E2: primary completed E3: some secondaryE4: secondary completed E5: some tertiary
C1 : glance C2: fairly thorough C3: very thorough
1673
494012
392919
204618
275
E5
E4
E3
E2
E1
C3
C2
C1
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0.0704 (84.52 %)
0.0129 (15.48 %)
-
7/21/2019 Green Acre 2007 Caco l
7/47
Three basic geometric concepts
profile mass
distance
profile the coordinates (position) of the point
mass the weight given to the point
distance the measure of proximity between points
-
7/21/2019 Green Acre 2007 Caco l
8/47
Four derived geometrical concepts
inertia the weighted sum-of-squared distances to centroid
centroid the weighted average position
projection the closest point in the subspace
centroido
projection
subspace
subspace space of reduced dimensionality within the space
o
inertia
mi di
inertia = i midi2
-
7/21/2019 Green Acre 2007 Caco l
9/47
Profile
A profile is a set of relative frequencies, that is a set of frequenciesexpressed relative to their total (often in percentage form).
Each row or each column of a table of frequencies defines a different
profile.
It is these profiles which CA visualises as points in a map.
E1
E2
E3
E4
E5
C1 C2 C3
1673
494012
392919
204618
275
original data
E1
E2
E3
E4
E5
C1 C2 C3
.62.27.12
.49.40.12
.45.33.22
.24.55.21
.14.50.36
row profiles
E1
E2
E3
E4
E5
C1 C2 C3
.13.05.05
.39.31.21
.31.22.33
.16.37.32
.02.05.09
column profiles
14
84
87
101
26
1
1
1
1
1
57 129 126 312 1 1 1
-
7/21/2019 Green Acre 2007 Caco l
10/47
Row profiles viewed in 3-d
-
7/21/2019 Green Acre 2007 Caco l
11/47
Plotting profiles in profile space
(triangular coordinates)E1 0.36 0.50 0.14
0.36
0.50
0.14
-
7/21/2019 Green Acre 2007 Caco l
12/47
Weighted average (centroid)
average
The average is the point at which the two points are balanced.
weighted average
The situation is identical for multidimensional points...
-
7/21/2019 Green Acre 2007 Caco l
13/47
Plotting profiles in profile space
(barycentric or weighted average principle)
E1 0.36 0.50 0.14
0.360.50
0.14
-
7/21/2019 Green Acre 2007 Caco l
14/47
Plotting profiles in profile space
(barycentric or weighted average principle)
E2 0.21 0.55 0.24
0.210.55
0.24
-
7/21/2019 Green Acre 2007 Caco l
15/47
Plotting profiles in profile space
(barycentric or weighted average principle)
E5 0.12 0.27 0.62
0.120.27
0.62
-
7/21/2019 Green Acre 2007 Caco l
16/47
Masses of the profiles
E1
E2
E3
E4
E5
C1 C2 C3
1673
494012
392919
204618
275
original data
14
84
87
101
26
57 129 126 312
.045
.269
.279
.324
.083
1
masses
.183 .413 .404 1averagerow profile
-
7/21/2019 Green Acre 2007 Caco l
17/47
Readership data
57
(0.183)
3
(0.115)
12
(0.119)
19
(0.218)
18
(0.214)
5
(0.357)
C1
312126
(0.404)
129
(0.413)Total
0.08326
16
(0.615)
7
(0.269)Some tertiaryE5
0.32410149
(0.485)
40
(0.396)Secondary completedE4
0.27987
39
(0.448)
29
(0.333)Some secondaryE3
0.2698420
(0.238)
46
(0.548)Primary completedE2
0.045142
(0.143)
7
(0.500)
Some primaryE1
MassTotalC3C2Education Group
C1: glance C2: fairly thorough C3: very thorough
-
7/21/2019 Green Acre 2007 Caco l
18/47
Calculating chi-square
2 = 12 similar terms ....
+(3 - 4.76) 2
+(7 -10.74) 2
+(16 -10.50) 2
4.76 10.74 10.50
.87......
.84......
.14......
57(0.183)
3
(0.115)
4.76
.
C1
312126
(0.404)129
(0.413)Total
0.08326
16
(0.615)
10.50
7
(0.269)
10.74
Observed Frequency
Some tertiary
Expected Frequency
E5
.101.....
MassTotalC3C2Education Group
For example,
expected frequencyof (E5,C1):
0.183 x 26 = 4.76
= 26.0
-
7/21/2019 Green Acre 2007 Caco l
19/47
Calculating chi-square
2 = 12 similar terms ....
+ 26 [ (3 / 26 - 4.76 / 26)2
+ (7 / 26 -10.74 / 26)2
+ (16 / 26 -10.50 / 26)2
]4.76 / 26 10.74 / 26 10.50 / 262/ 312 = 12 similar terms ....
+ 0.083[ (0.115 0.183)2
+ (0.269 0.413)2
+ (0.615 0.404)2
]0.183 0.413 0.404
.87......
.84......
.14......
57(0.183)
3
(0.115)
4.76
.
C1
312126
(0.404)129
(0.413)Total
0.08326
16
(0.615)
10.50
7
(0.269)
10.74
Observed Frequency
Some tertiary
Expected Frequency
E5
.101.....
MassTotalC3C2Education Group
-
7/21/2019 Green Acre 2007 Caco l
20/47
Calculating inertia
Inertia = 2/312 = similar terms for first four rows ...
+ 0.083[ (0.115 0.183) 2 + (0.269 0.413) 2 + (0.615 0.404) 2 ]0.183 0.413 0.404
mass(of row E5)
squared chi-square distance(between the profile of E5 and
the average profile)
Inertia = mass (chi-square distance)2
(0.115 0.183) 2
+(0.269 0.413) 2
+(0.615 0.404) 2 EUCLIDEAN
0.183 0.413 0.404 WEIGHTED
-
7/21/2019 Green Acre 2007 Caco l
21/47
How can we see chi-square distances?
Inertia = 2/312 = similar terms for first four rows ...
+ 0.083[ (0.115 0.183) 2 + (0.269 0.413) 2 + (0.615 0.404) 2 ]0.183 0.413 0.404mass
(of row E5)
squared chi-square distance(between the profile of E5 and
the average profile)
(0.115 0.183) 2 + (0.269 0.413)2
+ (0.615 0.404)2 EUCLIDEAN
0.183 0.413 0.404 WEIGHTED
( 0.115 0.183 )2
+ ( 0.269 0.413 )2
+ ( 0.615 0.404 )2
So the answer is to divide all profile elements by the of their averages
0.183 0.183 0.413 0.413 0.404 0.404
-
7/21/2019 Green Acre 2007 Caco l
22/47
Stretched row profiles viewed in
3-d chi -squared space
Pythagorian ordinary Euclidean
distances
Chi-square distances
profiles
vertices
-
7/21/2019 Green Acre 2007 Caco l
23/47
What CA does
centres the row and column profiles with respect to their averageprofiles, so that the origin represents the average.
re-defines the dimensions of the space in an ordered way: firstdimension explains the maximum amount of inertia possible in onedimension; second adds the maximum amount to first (hence first twoexplain the maximum amount in two dimensions), and so on untilall dimensions are explained.
decomposes the total inertia along the principal axes into principalinertias, usually expressed as % of the total.
so if we want a low-dimensional version, we just take the first(principal) dimensions
The row and column problem solutions are closely related,one can be obtained from the other; there are simple scaling
factors along each dimension relating the two problems.
-
7/21/2019 Green Acre 2007 Caco l
24/47
Asymmetric Maps using XLSTAT
E5
E4
E3
E2
E1
C3
C2
C1
-1.5
-1
-0.5
0
0.5
1
1.5
2
-1 -0.5 0 0.5 1 1.5
.07037 (84 ,5%)
.01289 (15,5%)
E1
E2
E3
E4
E5C1
C2
C3
-1
-0.5
0
0.5
1
1.5
2
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
.07037 (84,5%)
.01289 (15,5%)
-
7/21/2019 Green Acre 2007 Caco l
25/47
Symmetric Map using XLSTAT
some
tertiary
secondary
complete
secondaryincomplete
primary
complete
primary
incomplete
very thorough
fairly thorough
glance
-0.2
0
0.2
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
.07037 (8 4,5%)
.01289 (15,5%)
-
7/21/2019 Green Acre 2007 Caco l
26/47
Asymmetric and symmetric maps
Asymmetric maps represent the rows and columns jointly in
principal & standard coordinates; asymmetric maps are alsobiplots.
Because the principal coordinates can be much smaller than
the standard coordinates, especially whenk is small, thegenerally accepted way for the joint map is the symmetric map,where both rows and columns are in principal coordinates.
Symmetric maps are strictly speaking not biplots, but they
are almost so (see Gabriel, Biometrika, 2002).
-
7/21/2019 Green Acre 2007 Caco l
27/47
Data set product(McFie et al.)
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
A 3 16 14 13 14 18 6 18
B 1 15 6 8 10 13 14 9
C 13 11 4 13 11 4 10 2
D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15
F 3 16 14 15 12 14 7 16
G 18 12 13 16 13 5 4 7
H 2 14 7 6 10 4 14 8
I 10 14 13 12 14 16 4 8
ours 4 15 15 16 14 7 6 15
Our company wishes to identify the perceptions of itself and its nine majorcompetitors.
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which companies theyassociate with which of 8 attributes.
The aim is to gain an idea about the relationships between the competitorsand the attributes, and where our company is situated in the overall
scheme.
-
7/21/2019 Green Acre 2007 Caco l
28/47
Reduction of dimensionality
-
7/21/2019 Green Acre 2007 Caco l
29/47
Reduction of dimensionality
data centred
means
-
7/21/2019 Green Acre 2007 Caco l
30/47
Reduction of dimensionality
data centred
points weighted (row masses)
in case of frequency data, points are weighted by
their row masses, that is the relative frequencies of
each row (i.e. proportional to sample sizes, n)
-
7/21/2019 Green Acre 2007 Caco l
31/47
Reduction of dimensionality
data centred
points weighted (row masses)
metric weighted (column weights)
dii'2 = j wi (yij yi'j )2
i
i'
e.g. wj = 1/j2
the inverse of the variance in PCAw = 1/c the inverse of the expected value in CA
-
7/21/2019 Green Acre 2007 Caco l
32/47
Fat Freddys Cat Dimensional Transmogrifier
with thanks to Jrg Blasius
-
7/21/2019 Green Acre 2007 Caco l
33/47
Data set product(McFie et al.)
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
A 3 16 14 13 14 18 6 18
B 1 15 6 8 10 13 14 9
C 13 11 4 13 11 4 10 2
D 9 11 4 9 11 9 11 3E 6 14 15 17 14 16 8 15
F 3 16 14 15 12 14 7 16
G 18 12 13 16 13 5 4 7
H 2 14 7 6 10 4 14 8
I 10 14 13 12 14 16 4 8
ours 4 15 15 16 14 7 6 15
Our company wishes to identify the perceptions of its products and its 9major competitors (A, B, , I).
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which products they associatewith which of 8 attributes.
The aim is to gain an idea about the relationships between the competitorsand the attributes, and where our company is situated in the overallscheme.
Products
-
7/21/2019 Green Acre 2007 Caco l
34/47
Data set product(McFie et al.)
First note that this is NOT a contingency table, so the chi-square test is notapplicable (a permutation test could test for significance, but then we needto have original respondent-level data).
This is an interesting example because it can be analyzed as is or it canbe recoded to bring out certain features.
Analyzing it with no recoding means that the size effect (sometimescalled the halo effect) is removed, since we analyze profiles, i.e., the
counts relative to their total counts. In other words, if a product getsrelatively few associations, then it is the highest of these (lower)associations that are determinant. Hence, in the following extreme case,a pattern of [ 18 18 18 ] is identical to a pattern of [ 1 1 1 ] !
The masses assigned to the products will be proportional to the number ofassociations they get.
If the size effect is needed to be visualized as well, the data table should
be doubled.
D t t d t
-
7/21/2019 Green Acre 2007 Caco l
35/47
Data set product(McFie et al.)Company PQ In PR En PL MI PS GP Total
A 3 16 14 13 14 18 6 18 102
B 1 15 6 8 10 13 14 9 76
C 13 11 4 13 11 4 10 2 68
D 9 11 4 9 11 9 11 3 67
E 6 14 15 17 14 16 8 15 105
F 3 16 14 15 12 14 7 16 97G 18 12 13 16 13 5 4 7 88
H 2 14 7 6 10 4 14 8 65
I 10 14 13 12 14 16 4 8 91
ours 4 15 15 16 14 7 6 15 92
Company PQ In PR En PL MI PS GP Total
A 2.9 15.7 13.7 12.7 13.7 17.6 5.9 17.6 102
B 1.3 19.7 7.9 10.5 13.2 17.1 18.4 11.8 76
C 19.1 16.2 5.9 19.1 16.2 5.9 14.7 2.9 68
D 13.4 16.4 6.0 13.4 16.4 13.4 16.4 4.5 67
E 5.7 13.3 14.3 16.2 13.3 15.2 7.6 14.3 105
F 3.1 16.5 14.4 15.5 12.4 14.4 7.2 16.5 97
G 20.5 13.6 14.8 18.2 14.8 5.7 4.5 8.0 88
H 3.1 21.5 10.8 9.2 15.4 6.2 21.5 12.3 65
I 11.0 15.4 14.3 13.2 15.4 17.6 4.4 8.8 91ours 4.3 16.3 16.3 17.4 15.2 7.6 6.5 16.3 92
Products
Products
D t t d t ( l )
-
7/21/2019 Green Acre 2007 Caco l
36/47
Data set product(McFie et al.)
Com. PQ PQ- In In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- Total
A 3 15 16 2 14 4 13 5 14 4 18 0 6 12 18 0 144B 1 17 15 3 6 12 8 10 10 8 13 5 14 4 9 9 144
C 13 5 11 7 4 14 13 5 11 7 4 14 10 8 2 16 144
D 9 9 11 7 4 14 9 9 11 7 9 9 11 7 3 15 144
E 6 12 14 4 15 3 17 1 14 4 16 2 8 10 15 3 144
F 3 15 16 2 14 4 15 3 12 6 14 4 7 11 16 2 144G 18 0 12 6 13 5 16 2 13 5 5 13 4 14 7 11 144
H 2 16 14 4 7 11 6 12 10 8 4 14 14 4 8 10 144
I 10 8 14 4 13 5 12 6 14 4 16 2 4 14 8 10 144
ours 4 14 15 3 15 3 16 2 14 4 7 11 6 12 15 3 144
Doubling involves coding the counts of the numbers (out of 18) thatDONT associate the product with the attribute in each case.
There are now two columns per attribute each attribute is represented byits positive and negative end of the 0-to-18 scale of counts.
Doubled table:
Prod.
Row asymmetric map
-
7/21/2019 Green Acre 2007 Caco l
37/47
ours I
H
G
FE
D CB
A
GlobProd
PriceSens
ModImage
PriceLevel
Environm
ProdRange
Innovatn
ProdQual
-2
-1
0
1
2
3
-2 -1 0 1 2 3
0.0765 (53.1%)
0.0478 (33.2 %) Row points are
projections ofrow profiles have inertiasalong axes equalto principalinertias (henceprincipalcoordinates).
Column pointsare projections of
extreme cornerprofiles, orvertices (cf.triangle) have inertiaalong axes equalto 1 (hencestandardcoordinates).
Profile points
generally closeto average.
Row asymmetric map
-
7/21/2019 Green Acre 2007 Caco l
38/47
Row pointsand columnpoints are bothdisplayed in
principalcoordinates both haveinertias alongaxes equal toprincipalinertias.
Both sets ofpoints occupysimilar regions
of the map:aesthetically abetter graphic.
Symmetric map
GlobProd
PriceSens
ModImage
PriceLevel
Environm
ProdRange
Innovatn
ProdQualours
I
H
G
FE
D
C
B
A
-0.4
-0.2
0
0.2
0.4
0.6
-0.4 -0.2 0 0.2 0.4 0.6 0.8
0.0765 (53.1%)
0.0478 (33.2%)
-
7/21/2019 Green Acre 2007 Caco l
39/47
Attributes havepositive andnegative pole average
association is atthe origin of themap, e.g.,In(novation) hashigh average,P(roduct)Q(uality)has low average.
Fairly similarconfiguration toundoubled
analysis: there isno strong haloeffect.
Doubled data: symmetric map
GP-
GP
PS-
PS
MI-
MI
PL-
PL
En-
En
PR-
PR In-
In
PQ-
PQ
ours
I
H
G
F
E
D
C
B
A
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
0.1173 (54.5%)
0.0682 (31.7%)Highproductquality
High price sensitive;low environment,product range andprice level
Highproductrange,modernimage,
globalproducts
-
7/21/2019 Green Acre 2007 Caco l
40/47
Inertia contributions in CA
Correspondence analysis (CA) is a method of data visualization whichrepresents the true positions of profile points in a map which comesclosest to all the points, closest in sense of weighted least-squares.
O
O
OO
O
OO
O
O
O
The inertia explained in the map applies to all the points: if we say83% of the inertia is explained in the map, 71% on the firstdimension and 12% on the second, this is a figure calculated for allrow (or column) points together.
71%
12%
-
7/21/2019 Green Acre 2007 Caco l
41/47
Inertia contributions in CA
This type of inertia-explained-by-axes calculation can be made forindividual points.
These more detailed results are aids to interpretation in the form ofnumerical diagnostics, called contributions.
Especially when there is not a high percentage of inertia explained by themap, these contributions will help us to identify points which are
represented inaccurately. The inertias and their percentages tell us how much of the variance in
the table is explained by the principal axes. The contributions do thesame, but for each point individually, and help us to see:
(a) which points are being explained better than others;(b) which points are contributing to the solution more than others.
-
7/21/2019 Green Acre 2007 Caco l
42/47
Geometry of inertia contributions
centroid c
i-th point aiwith mass mi
k-th principalaxis
projection on
axis
di
fik
Total inertia of the cloud of points = i mi di2 = i mi kfik
2 = kk
Inertia of i-th point = mi di2 = mi kfik
2
Inertia contribution of i-th point to k-th axis = mifik2
-
7/21/2019 Green Acre 2007 Caco l
43/47
Geometry of inertia contributions
centroid c
i-th point aiwith mass mi
k-th principalaxis
projection on
axis
di
fik
Total inertia of the cloud of points = i mi di2 = i mi kfik
2 = kk
Inertia of i-th point = mi di2 = mi kfik
2
Inertia contribution of i-th point to k-th axis = mifik2
m1f112 m1f12
2 ... m1f1p2
m2f212 m2f22
2 ... m2 f2p2
m3f312 m3f322 ... m3f3p2: : :
: : :
: : :
: : :
mnfn12 mnfn2
2 ... mnfnp2
1
2
3
n
Axes
1 2 ... p
m1 d12
m2 d22
m3 d32:
:
:
mn dn2
1 2 ... p
b
-
7/21/2019 Green Acre 2007 Caco l
44/47
Inertia contributions
centroid c
i-th point aiwith mass mi
k-th principalaxis
projection on
axis
di
fik
m1f112 m1f12
2 ... m1f1p2
m2f212 m2f22
2 ... m2 f2p2
m3f312 m3f322 ... m3 f3p2: : :
: : :
: : :
: : :
mnfn12 mnfn2
2 ... mnfnp2
1
2
3
n
Axes1 2 ... p
m1 d12
m2 d22
m3 d32:
:
:
mn dn2
1 2 ... p
mifik2/ k : amount of inertia of axis k explained by point i (absolute contribution, CTR)
mifik2/ midi
2 : amount of inertia of point i explained by axis k (relative contribution, COR)
mifik2/ midi2 = fik2/ di2 , i.e. the square offik/ di = cos(ik), whereik is the angle point-axis
ik
Contributions to axes and
-
7/21/2019 Green Acre 2007 Caco l
45/47
Contributions to axes and
contributions to points(product data, doubled)Contributions (rows):
Weight (relativ F1 F2A 0.100 0.200 0.010
B 0.100 0.006 0.266
C 0.100 0.249 0.031
D 0.100 0.153 0.011
E 0.100 0.113 0.010
F 0.100 0.113 0.004
G 0.100 0.037 0.414
H 0.100 0.074 0.202
I 0.100 0.009 0.044ours 0.100 0.048 0.010
Squared cosines (rows):
F1 F2A 0.922 0.027
B 0.033 0.914
C 0.901 0.065
D 0.856 0.035
E 0.827 0.045
F 0.929 0.017
G 0.129 0.839
H 0.320 0.510
I 0.087 0.259
ours 0.389 0.046
Eigenvalues and percentages of inertia:
F1 F2
Eigenvalue 0.117 0.068
Rows depend 54.482 31.656
Cumulative % 54.482 86.139
Not so well-represented
After: Correspondence Analysis in the
-
7/21/2019 Green Acre 2007 Caco l
46/47
CARME 2007
Correspondence Analysis &
Related Methods
Erasmus University
Rotterdam
25-27 June 2007
http://www.carme-n.org
Correspondence Analysis in theSocial Sciences (Cologne,1991)
Visualizing Categorical Data(Cologne, 1995)
Large Scale Data Analysis(Cologne, 1999)
Correspondence Analysis and
Related Methods (CARME 2003)(Barcelona, 2003)
Just pubished by
-
7/21/2019 Green Acre 2007 Caco l
47/47
Just pubished byChapman & Hall /
CRC Press