Cluster Analysis
description
Transcript of Cluster Analysis
Cluster Analysis
2
First used by Tryon (1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories.
3
對資料作簡化的工作與分類,也就是把相似的個體 (觀察物 )歸於一群。
將事物根據某些屬性歸集在各個群體之中,使在同一個集群內的事物都具有相同的特性 (homogeneity) ,而在不同的集群之間卻有顯著的差異性。
集群分析 目的
4
若從幾何圖形來看。同一集群內的分子應聚集在一起,而不同集群的分子應該彼此遠離。
5
集群分析應用
教學應用 醫學界 社會學 心理學 經濟學 生物學
6
集群分析的計算
利用觀察體的『距離』資料或『相似性』資料為根據。兩者的『距離』量數愈小,則兩個觀察體在某方面就愈類似,『相似性』的量數也就愈大。
利用計算出來的『距離矩陣』或『相似係數矩陣』,便可根據某些標準將 N個觀察體依次加以歸併最後可以聚集成幾個代表性的集群。
7
距離式的衡量:以點與點之間的距離為測度,較常採用的方法為歐幾里德距離 (Euclidean Distance) :若有 N個觀察體,每個觀察體有 M個屬性,則令 X是 N*M 的資料矩陣,點與點的歐幾里德距離為
8
dij= 2
1
1
2
m
pjpip xx
如果各屬性的衡量單位不同,則在計算歐幾里德距離前宜將各變數之數值予以標準化,使其平均數為 0,而標準差為 1。
9
相似性的衡量:
相似性愈大表示兩種觀察體之相異姓愈小,因而再相似性矩陣運算中,要將相似性數值愈大的集群先加以合併。
兩觀察體之間的相似性可用下述配合係數 (matching coefficient) 來衡量:
10
m
baM ij
a 為觀察 i與 j共同具有的屬性數目;b為觀察 i與 j共同不具有的屬性數目m為屬性的總數。
Ex: 設 i與 j具備的屬性如下: (1代表具備該屬性, 0代表不具備該屬性 )
A B C D E Fi 1 0 0 1 1 1j 0 0 1 1 0 0
11
則配合係數為:
3
1
6
2
6
11
m
baM ij
12
集群分析運算的方法非階層式 (non-hierarchical) 的集群分析 :直接由距離或相似性矩陣開始運算,可分為下列幾種: a. 連續關鍵值法 (sequential threshold) :
使用本法時,事先要挑選一個集群核心,並訂定一個關鍵值,所有與此一中心點之距離在某一預定關鍵值內的各觀察點即形一集群;然後再選擇另一新的集群核心,對尚未歸入集群之各觀察點則歸入第二集群,如此依次連續進行。
13
b. 平行關鍵值法 (paralleled threshold) :
此法一開始就同時將幾個集群核心選定並訂定關鍵值,然後根據關鍵值,將各個觀察點歸入最近的集群中心,形成各集群。同時關鍵值亦可加以調整,以允許較多 (或較少 )的觀察點進入各集群中。
c.最適劃分法 (optimizing partitioning) :此法是以某一效標 ( 如平均之群內距離為最小 ) 為基礎,不斷嘗試各種分類,直到效標值 (criterion measure) 達到最佳值為止。
14
d. 平均數法 (K-means Method) :
此法是上述方法的一種整合應用,其步驟是將各觀察值分割為 K個集群,然後計算觀察體到各集群重心的距離,並將各觀察體分派到距離最近的集群內。重新計算得到新觀察體與喪失該觀察體的集群重心,再依各觀察體到各集群重心的距離。如此反覆計算,直到各群沒有須重新分配的觀察體為止。
15
階層式 (hierarchical) 的集群分析
特性是每一個新的集群,都是由前一階層形成的集群而集結或分裂而成,因此集群分析後可形成一個樹狀結構。 在階層式分裂法中,常見的方法為平均距離分裂法,其分析步驟 :
16
先找出一個與其他觀察體平均距離最遠者,將此觀察體稱為分裂群,其餘的觀察體稱之為主要群,然後計算分裂群與主要群間、以及主要群之內各觀察體之間的距離。
若主要群之間某一觀察體與主要群其它觀察體的距離,大於此觀察體與分裂群的距離,則將之歸入分裂群,反之則留在主要群中。
17
以 K-means 計算步驟說明之:
1. 將各個觀察體分割成 K個原始集群
2. 計算某一觀察體到各集群中心(平均數)距離(通常採用歐氏距離),接著將一些觀察體分派到距離最近的那個集群。最後則重新計算得到新觀察體及喪失該等觀察體的兩個集群之新中心。
3.重複第二步驟,直到各觀察體都不必重新分派到其他集群為止。
18
Ex:四個觀察體在兩個變項上的數量分布
1 2 3 4X1 12 -8 4 -8X2 8 4 -2 -6
首先將此四個觀察體任意分割成兩個集群,如集群【 1 , 2 】及集群【 3, 4 】,然後計算這兩個集群的形心之座標如下:
集群【 1 , 2 】 集群 【 3, 4 】
22
812
X1= X1= 2
2
84
19
X2= 62
48
X2= 4
2
62
接著計算每一觀察體到各集群形心的歐氏距離,並將其分派到距離最近的集群。如 D21【 1, 2】 =(12-2)2+(8-6)2=104
由上述計算可知:觀察體 4與集群【 3, 4 】距離較近,故不必重新分派;觀察體 2與集群【 3,4 】的距離較近,故將之分派到集群【 3, 4 】而得到新的兩個集群【 1】、【 2, 3, 4 】,其形心之座標如下:
20
集群【 1】 集群【 2, 3, 4 】
X1=12 X1= 43
848
X2=8 X2= 33.13
624
繼續計算各觀察體到集群【 1】及集群【 2, 3,4 】的歐氏距離
21
集群 1 2 3 41【 】 0 416 164 596
2 3 4【 , , 】 343.05 44.41 64.45 37.81
由資料顯示:觀察體 1與集群【 1】距離最近;觀察體 2, 3, 4與集群【 2, 3, 4 】之距離最近,所以不需再重新計算分派,而得到 K=2個集群,分別為集群【 1】及集群【 2, 3, 4 】。
22
Two-stage Cluster Sampling When Clusters are of
Unequal Size
Desired Sample Proportion p=n/N a: Desired # of Clusters Selected in
the 1st Stage A: Total # of Clusters b: Sample Size within Each Cluster
Selected Ni : # of Elements in Cluster i
23
Simple Two-stage Cluster Sampling
The First-stage Prob. p1=a/A The Second-stage Prob. p2=p(a/
A) Sample size in cluster I, ni =p2*Ni
24
Probability Proportional to Size
Nn=ab=b*
NNN
aN=p
i
A
i
ii
A
i
i
where
a
Np=b
i
A
i
25
Example
Draw a sample of 1,000 households from a city that contains about 200,000 households distributed among 2000 blocks of unequal but known size.
The desired sample proportion =1/200 The desired # of clusters selected in
the 1st stage=100 How do we conduct the two-stage
cluster sampling?
26
What is Cluster Analysis? Cluster Analysis is a class of statistical
techniques that can be applied to data that exhibit natural groupings.
CA is an interdependence technique that makes no distinction between dependent and independent variables.
There is NO statistical significance testing in CA.
CA is more a group of different algorithms that put objects into clusters following “well-defined similarity rules.”
27
What is A Cluster?
A cluster is a group of relatively homogeneous cases and observations.
Clusters exhibit high internal homogeneity and high external heterogeneity.
28
A Cluster Diagram: Drinker’s Perceptions of Alcohol
29
Characteristics of CA
Cluster Analysis is a tool of discovery.
It discovers structures in data but does NOT explain why they exist.
CA is used when we do not have an a priori hypothesis, but when we are in the exploratory phase.
30
How does CA differ…
From Discriminant AnalysisA dependence techniquePredict the probability that an object will f
all into one of two or more mutually exclusive categories based on several independent variables.
Find a linear combination of independent variables.
Find natural groupings based on distances among objects.
31
From Factor AnalysisSimilar to cluster analysis in that it is an interdependence technique.
Primary difference lies in the focus on objects and variables.
Factor analysis reduces variables to a few factors. Cluster analysis reduces objects to a few clusters.
32
Cluster Analysis Methods
Three Cluster Analysis Methods Joining (Tree Clustering)Two-way JoiningK-means Testing
33
Joining (Tree Clustering)
A type of hierarchical clustering -- agglomerative
Each unit is a cluster.
Dendogram Many other metho
ds
34
The first level shows all samples xi as singleton clusters. Increase levels, more samples are clustered together in a hierarchical manner.
35
It is based on sets where each cluster level may contain sets that are subclusters as shown in the Venn diagram.
36
Two-way Joining Hartigan (1975)
Two-way Joining tries to cluster both variables and objects.
Only useful if you think clustering along BOTH lines will be useful.
Very rare in application.
37
k-Means Clustering:
Begin with a preconception about the number of clusters (k).
Thought of as ANOVA in reverse.ANOVA evaluates between group var. ag
ainst within group var. when computing stat. signif. of hypothesis that groups are different.
In k-Means the computer will try to move objects in and out of the groups to get the most significant ANOVA results.
38
It’s all about distance…
Distance MeasuresEuclidean DistanceSquared Euclidean DistanceManhattan DistanceChebychev DistancePower Distance
39
EQUATION: Euclidean Distance
Basic equation for determining distance measure.
Distance (x,y) = {Σi (xi – yi)2}1/2
A standard formula for determining the distance between two points on a plane
40
Fairly simple, right?
41
In other words, how do we get from this…
42
To this…
43
To this…
44
How to Determine Clusters.
Use a computer.
Call a professional.
45
Clusters in the Real World
46
Why is Cluster Analysis Important?Relatively new/evolving techniqueHighly useful for market segmentationSegmentation = identifying groupings of
customers using statistical multi-variate analysis, often based on perceptions and attitudes as well as demographics and behavior.
Segmentation helpful to small companies attempting to carve out a niche
Large companies trying to tailor their products/services to different segments
47
In addition to segmentation, clusters are used to…Design products and establish
brandsTarget direct mailMake decisions about customer
conversion and retentionDecide on marketing cost levels
48
Ex: Luxury Car CustomersDemographic examples easier to
illustrateDemographics:
GenderEducationAge
149 customers (objects) of a luxury car dealership
49
Using SPSS for Clustering
Chose “TwoStep Cluster Analysis”Basically, the agglomerative techniqu
e (dendogram).Step One: Creates very small (individu
al) sub-clusters.Step Two: Cluster sub-clusters into de
sired number of clusters.Automatically finds optimum number
of clusters.
50
Two-Step CA Output
What are these clusters?
Cluster Distribution
43 28.9% 28.9%
27 18.1% 18.1%
29 19.5% 19.5%
21 14.1% 14.1%
29 19.5% 19.5%
149 100.0% 100.0%
149 100.0%
1
2
3
4
5
Combined
Cluster
Total
N% of
Combined % of Total
51
Two-Step CA Output
Age
0 .0% 16 34.8% 18 35.3% 7 33.3% 2 20.0%
16 76.2% 0 .0% 10 19.6% 0 .0% 1 10.0%
3 14.3% 4 8.7% 15 29.4% 1 4.8% 6 60.0%
2 9.5% 10 21.7% 7 13.7% 1 4.8% 1 10.0%
0 .0% 16 34.8% 1 2.0% 12 57.1% 0 .0%
21 100.0% 46 100.0% 51 100.0% 21 100.0% 10 100.0%
1
2
3
4
5
Combined
ClusterFrequency Percent Frequency Percent Frequency Percent Frequency Percent Frequency Percent
2 3 4 5 6
52
Education
0 .0% 5 41.7% 0 .0% 38 60.3% 0 .0%
2 100.0% 0 .0% 0 .0% 19 30.2% 6 17.6%
0 .0% 0 .0% 29 76.3% 0 .0% 0 .0%
0 .0% 0 .0% 0 .0% 0 .0% 21 61.8%
0 .0% 7 58.3% 9 23.7% 6 9.5% 7 20.6%
2 100.0% 12 100.0% 38 100.0% 63 100.0% 34 100.0%
1
2
3
4
5
Combined
ClusterFrequency Percent Frequency Percent Frequency Percent Frequency Percent Frequency Percent
1 2 3 4 5
53
Gender
0 .0% 43 48.9%
19 31.1% 8 9.1%
13 21.3% 16 18.2%
0 .0% 21 23.9%
29 47.5% 0 .0%
61 100.0% 88 100.0%
1
2
3
4
5
Combined
ClusterFrequency Percent Frequency Percent
0 1
54
What does this mean?
Cluster 5:Age: 36 - 65Education: High School graduate or
aboveGender: Female
Could have used k-Means, would have generated different results.
Clustering is a powerful marketing research tool.
55
Claritas: Clustering Experts
Example: Claritas Corporation Claritas founded the U.S. geodemographic in
dustry when it launched the first PRIZM segmentation system in 1974.
PRIZM (Potential Rating Index for Zip Markets) categorizes every U.S. neighborhood into 1 of 62 “clusters.”
Descriptive Names:Money and BrainsYoung LiteratiShotguns and Pickups
56
Money and BrainsSophisticated Urban Fringe Couples
Cluster is a mix of family types: singles, married couples with children and married couples without children. These families own their homes in upscale neighborhoods near cities. Dual incomes provide luxuries, travel and entertainment.
Demographics: Affluent Age Groups: 55-64, 65+Predominantly White, High Asian
57
Clusters Work!
At a conservative estimate, more than 20,000 companies in the United States and Canada alone used clusters as part of their marketing information mix last year.
58
Web Sources
http://cwis.livjm.ac.uk/bus/busrmccl/ae230/lect10.ppt http://www.clusterbigip1.claritas.com/claritas/Default.jsp?main=3&su
bmenu=seg&subcat=segprizm http://www.clusterbigip1.claritas.com/claritas/Default.jsp?main=3&su
bmenu=seg&subcat=segprizmne http://www.insightsc.ie/newsletter7.htm http://www.directionsmag.com/article.asp?article_id=12 http://fun.supereva.it/scoleri.freeweb/cern/biografie/hawking.jpg http://www.statsoft.com/textbook/stcluan.html http://www-db.stanford.edu/~ullman/mining/cluster1.pdf http://www.snr.missouri.edu/multivariate/ClusterAnalysis.pdf
59
Print Sources
Recent Developments in Clustering and Data Analysis. Edited by Chikio Hayashi, Edwin Diday, Michel Jambou, Noboru Ohsumi. Academic Press, Inc. 1988.
Finding Groups in Data: An Introduction to Cluster Analysis. Leonard Kaufman, Peter J. Rousseeuw. John Wiley and Sons, Inc. 1990.
Marketing Research: An Aid to Decision Making. Dr. Alan T. Shao. South-Western. 2002.
Exploring Marketing Research. William G. Zikmund. South-Western. 2003.
60
Ex 7: Hypothetical Data
Subject Id. Income ($1000) Education (years)
S1 5 5
S2 6 6
S3 15 14
S4 16 15
S5 25 20
S6 30 19
61
Similarity Matrix (Euclidean Distances)
Id S1 S2 S3 S4 S5 S6
S1 0 2 181 221 625 821
S2 2 0 145 181 557 745
S3 181 145 0 2 136 250
S4 221 181 2 0 106 212
S5 625 557 136 106 0 26
S6 821 745 250 212 26 0
d(S1, S3) = (15-5)2 + (19-5)2 = 181
d(S1, S2) = 2 距離最小 (相似性較高 ) 故合併
62
Centroid Method: Five ClustersData For Five ClustersCluster Cluster
MembersIncome
($1000)
Education
(years)
1 S1&S2
(5,5) (6,6)
5.5
5+6/2
5.5
5+6/2
2 S3 15 14
3 S4 16 15
4 S5 25 20
5 S6 30 19
63
Similarity Matrix (Euclidean Distances)
Id S1 &S2 S3 S4 S5 S6
S1 &S2 0 162.5 200.5 590.5 782.5
S3 162 0 2 135.96 250
S4 200.5 2 0 106 212
S5 590.5 135.96 106 0 26
S6 782.5 250 212 26 0
d(S1& S2 , S3) = (5.5-15)2 + (5.5-14)2 = 162.5
d( S3, S4) = 2 距離最小 (相似性較高 ) 故合併
64
Centroid Method: Four ClustersData For Four ClustersCluster Cluster
MembersIncome
($1000)
Education
(years)
1 S1&S2
(5,5) (6,6)
5.5
5+6/2
5.5
5+6/2
2 S3 & S4
(15,14) (16,15)
15.5 15+16/2
14 .5 14+15/2
3 S5 25 20
4 S6 30 19
65
Similarity Matrix (Euclidean Distances)
Id S1 &S2 S3&S4 S5 S6
S1 &S2 0 181 590.5 782.5
S3 & S4 181 0 120.5 230.5
S5 590.5 120.5 0 26
S6 782.5 230.5 26 0
d(S1& S2 , S5) = (5.5-25)2 + (5.5-20)2 = 590.5
d( S5, S6) = 26 距離最小 (相似性較高 ) 故合併
66
Centroid Method: Three ClustersData For Three ClustersCluster Cluster
MembersIncome
($1000)
Education
(years)
1 S1&S2
(5,5) (6,6)
5.5
5+6/2
5.5
5+6/2
2 S3 & S4
(15,14) (16,15)
15.5 15+16/2
14 .5 14+15/2
3 S5 & S6 (25,20) (30,19)
27.5 25+30/2
19.5 14+15/2
67
Similarity Matrix (Euclidean Distances)
Id S1 &S2 S3&S4 S5 & S6
S1 &S2 0 181 680
S3 & S4 181 0 169
S5 & S6 680 169 0
d(S1& S2 , S5 & S6) = (5.5-27.5)2 + (5.5-19.5)2 = 680
d( S3 & S4, S5 & S6) = 169 距離最小 (相似性較高 ) 故合併
68
Exhibit 7-1:SAS Output for cluster analysis on data in Table 7.1
Simple statistics
Mean Std Dev Skewness Kurtosis Bimodality
INCOME 16.1667 9.9883 0.2684 -1.4015 0.2211
EDUC 13.1667 6.3692 -0.4510 -1.8108 0.2711
Root-Mean-Square Total-Sample Standard Deviation = 8.376555
1
從此處無法得知分出幾群
69
Step Number Frequency RMS STDNumber of of New of New Semipartial Centroid Clusters Clusters Joined Cluster Cluster R-Squared R-Squared Distance 1 5 S1 S2 2 0.707107 0.001425 0.998575 1.4142 2 4 S3 S4 2 0.707107 0.001425 0.997150 1.4142 3 3 S5 S6 2 2.549510 0.018527 0.978622 5.0990 4 2 CL4 CL3 4 5.522681 0.240855 0.737767 13.0000 5 1 CL5 CL2 6 8.376555 0.737767 0.000000 19.7041
3a 3b 3c 3d 3e 3f 3g
2 Root-Mean-Square Total-Sample Standard Deviation=8.376555
(RMSSTD)
RMSSTO越小表示群內個體相似性越高 (只能用來檢測相似性 )
很好的切點 ,因 R2急速下降
70
CLUSTER=1 CLUSTER=2 CLUSTER=3OBS SID INCOME EDUC OBS SID INCOME EDUC OBS SID INCOME EDUC 1 S1 5 5 3 S3 15 14 5 S5 25 20 2 S2 6 6 4 S4 16 15 6 S6 30 19
4
71
Exhibit 7.2:Non-hierarchical Clustering On Data
Replace=FULL Radius=0 Maxclusters=3 Maxiter=20 Converge=0.02
Initial Seeds
Cluster INCOME EDUC
-------- -----------------------------------
1 5.0000 5.0000
2 30.0000 19.0000
3 16.0000 15.0000
1
2
最初選擇觀察的種子點 S1, S6, S4
72
Exhibit 7-2 (continued)
Minimum Distance Between Seeds = 14.56022
Iteration Change in Cluster Seeds 1 2 3 -------------------------------------------------- 1 0.707107 2.54951 0.707107 2 0 0 0
Statistics for Variables Variable Total STD Within STD R-Squared RSQ/(1-RSQ)-------------- ------------------------------------------------------------------------------ INCOME 9.988327 2.121320 0.972937 35.950617 EDUC 6.369197 0.707107 0.992605 134.222222 OVER-ALL 8.376555 1.581139 0.978622 45.777778
3
4
RMSSTD: 記得與 Total STD 比較
Ex: ED: 0.7071/6.3691
Total 的 R2 值很接近 1.表示此分群 Incom & EDUC
的異質性高
較大 .故 EDUC 群內的同質性較高
R2 = SSB/SST
73
Exhibit 7-2 (continued)
Pseudo F Statistic = 68.67
Approximate Expected Over-All R-Squared = .
Cubic Clustering Criterion = .
WARNING: The two above values are invalid for correlated variables.
Cluster Means
Cluster INCOME EDUC
--------- -----------------------------------
1 5.5000 5.5000
2 27.5000 19.5000
3 15.5000 14.5000
5
另一種分群方法 (集群平均數 )
可被解釋為
低收入 ,低教育程度
高收入 ,高教育程度
中收入 , 中教育程度
74
Exhibit 7.4: Hierarchical Cluster Analysis For Food Data
SINGLE LINKAGE CLUSTER ANALYSIS SIMPLE STATISTICS MEAN STD DEV SKEWNESS KURTOSIS BIMODALITY CALORIES 207.407 101.208 0.542 -0.675 0.478 PROTEIN 19.000 4.252 -0.824 1.327 0.357 FAT 13.481 11.257 0.790 -0.624 0.589 CALCIUM 43.963 78.034 3.159 11.345 0.746 IRON 2.381 1.461 1.230 1.469 0.518
75Exhibit 7.4 (continued)
COMPLETE LINKAGE CLUSTER ANALYSIS NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL MAXIMUM CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE
10 CL15 CANNED CRABMEAT 4 11.32324 0.003476 0.985594 50.6665 9 CL17 ROAST LAMB SHOUL 3 12.59929 0.003226 0.982367 55.6611 8 CL14 CANNED SHRIMP 3 16.10565 0.005231 0.977136 71.1677 7 CL13 ROAST BEEF 6 14.34190 0.009755 0.967381 80.9343 6 CL10 CL8 7 22.14096 0.023782 0.943599 108.1758 5 CL9 CL11 11 20.22234 0.039103 0.904496 141.7814 4 CL6 CL12 9 30.07489 0.048662 0.855835 154.4447 3 CL7 CL5 17 38.73570 0.220433 0.635402 262.5666 2 CL4 CANNED SARDINES 10 51.36181 0.192623 0.442779 364.8934 1 CL3 CL2 27 57.40958 0.442779 0.000000 433.7617
(完全連鎖法 )
76Exhibit 7.4 (continued)
ROOT-MEAN-SQUARE TOTAL-SAMPLE STANDARD DEVIATION = 57.4096 NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL MINIMUM CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE
10 CANNED CANNED 2 11.16786 0.001455 0.973438 35.3159
MACKEREL SALMON 9 CL14 ROAST LAMB 3 12.59929 0.003226 0.970211 35.4131 SHOULDER 8 CL11 CANNED 12 16.80697 0.014701 0.955510 39.5267 CRABMEAT 7 CL15 CL9 8 20.48901 0.028341 0.927169 40.1627 6 CL7 CL8 20 40.04817 0.285060 0.642109 40.2746 5 CL12 CANNED 3 16.10565 0.005231 0.636878 44.8504 SHRIMP 4 CL6 ROAST BEEF 21 43.49500 0.085924 0.550954 45.7642 3 CL4 CL5 24 48.72189 0.189548 0.361406 48.7139 2 CL3 CL10 26 50.53988 0.106595 0.254811 62.2624 1 CL2 CANNED 27 57.40958 0.254811 0.000000 211.5691
SARDINES
RMS越小表示此群內的同質性越高
R2越大 .表示此群間的相異性越高
太低可刪除
77Exhibit 7.4 (continued)
CENTROID HIERARCHICAL CLUSTER ANALYSIS
NUMBER FREQUENCY RMS STD OF CLUSTERS OF NEW OF NEW SEMIPARTIAL CENTROID CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED DISTANCE 10 CL15 CANNED 4 11.32324 0.003476 0.985594 44.5633 CRABMEAT 9 CL16 ROAST LAMB 3 12.59929 0.003226 0.982367 45.5370 SHOULDER 8 CL14 CANNED SHRIMP 3 16.10565 0.005231 0.977136 57.9815 7 CL13 CL10 12 16.80697 0.026857 0.950279 65.6901 6 CL12 ROAST BEEF 6 14.34190 0.009755 0 940524 70.8222 5 CL6 CL9 9 24.36751 0.039727 0.900797 92.2533 4 CL8 CL11 5 26.85628 0.026158 0.874639 96.6423 3 CL7 CL4 17 31.36108 0.113709 0.760930 117.4906 2 CL5 CL3 26 50.53988 0.506119 0.254811 191.9655 1 CL2 CANNED 27 57.40958 0.254811 0.000000 336.7134
SARDINES
損失群內相似性 表群內相似性越高
( 中心法 )
78Exhibit 7.4 (continued)
WARD'S MINIMUM VARIANCE CLUSTER ANALYSIS NUMBER FREQUENCY RMS STD BETWEEN- OF CLUSTERS OF NEW OF NEW SEMIPARTIAL CLUSTER CLUSTERS JOINED CLUSTER CLUSTER R-SQUARED R-SQUARED SUM OF
SQUARES
10 CL14 CANNED 4 11.32324 0.003476 0.985908 1489.42 CRABMEAT 9 CL16 CL20 8 7.75641 0.003541 0.982367 1517.12 8 CL15 CANNED 3 16.10565 0.005231 0.977136 2241.24 SHRIMP 7 CL12 ROAST BEEF 6 14.34190 0.009755 0.967381 4179.83 6 CL10 CL8 7 22.14096 0.023782 0.943599 10189.5 5 CL11 CL9 11 20.22234 0.039103 0.904496 16754.1 4 CL6 CL13 9 30.07489 0.048662 0.855835 20849.7 3 CL5 CL4 20 36.22080 0.158726 0.697109 68007.8 2 CL3 CANNED 21 47.72546 0.240715 0.456394 103137
SARDINES 1 CL7 CL2 27 57.40958 0.456394 0.000000 195548
(華德法 )
79Exhibit 7.5: Non-Hierarchical Analysis For Food-Nutrient Data
INITIAL SEEDS ( 以中心法為主 ) CLUSTER CALORIES PROTEIN FAT CALCIUM IRON---------------------------------------------------------------------------------------------------
1 331.111 19.000 27.556 8.778 2.467
2 161.667 20.500 7.500 14.250 1.925 3 100.000 14.800 3.400 114.000 3.000
1
80Exhibit 7.5 (continued)
MINIMUM DISTANCE BETWEEN SEEDS = 117.4876 ITERATION CHANGE IN CLUSTER SEEDS 1 2 3----------------------- ------------------------------------------ 1 10.8475 6.46446 0.3 2 0 6.85281 12.7855 3 0 0 0
2
81
CLUSTER SUMMARY
MAXIMUM DISTANCECLUSTER RMS STD FROM SEED TO NEAREST CENTROIDNUMBER FREQUENCY DEVIATION OBSERVATION CLUSTER DISTANCE
---------------------------------------------------------------------------------------------------------------- 1 8 20.8936 78.8882 2 168.5 2 12 16.3651 70.9576 3 117.9 3 6 27.8059 79.6672 2 117.9 共有 ?個體 群 2的同質性較高 最接近 兩群中心點
的群數 的距離
3a
3b 3c 3d 3e
RMS STD 值越小 ,表群內相似性越高
82當測量尺度 (工具 )相同時 ,可直接比較 RMSSTD. 若不同時 , 必須討論 Within SD/Total SD
VARIABLE TOTAL STD WITHIN STD R-SQUARED RSQ/(1-RSQ) ---------------------------------------------------------------------------------------------------------- CALORIES 103.06085 39.89286 0.86216 6.25453 PROTEIN 4.29257 3.58590 0.35798 0.55758 FAT 11.44357 4.52989 0.85584 5.93681 CALCIUM 44.70188 22.76009 0.76150 3.19291 IRON 1.49005 1.51663 0.04688 0.04919 OVER-ALL 50.53988 20.71299 0.84547 5.47135
PSEUDO F STATISTIC = 62.92 APPROXIMATE EXPECTED OVER-ALL R-SQUARED = 0.78678 CUBIC CLUSTERING CRITERION = 2.186
4a 4b 4cSTATISTICS FOR VARIABLES
此兩群體值太小 . 表示測量此兩群並無差異即分群效果差 .刪除此二組資料再分群一次
WITHIN STD/ TOTAL STD 之值越小 ,表示群內同質性越高
83Exhibit 7.5 (continued)
CLUSTER MEANS CLUSTER CALORIES PROTEIN FAT CALCIUM IRON----------------------------------------------------------------------------------------------- 1 341.875 18.750 28.875 8.750 2.437 2 174.583 21.083 8.750 11.833 2.083 3 98.333 14.667 3.167 101.333 2.883
5
Cluster 1 命名為卡洛里
Cluster 2 命名為肥胖指數
Cluster 3 命名為低鈣
84Exhibit 7.5 (continued)
CLUSTER=1 OBS NAME CLUS DISTA CALORIES PROTEIN FAT CALCIUM IRON 1 BRAISED BEEF 1 2.4357 340 2 0 28 9
2.6 2 ROAST BEEF 1 78.8882 420 15 39 7
2.0 3 BEEF STEAK 1 33.2744 375 19 32 9
2.6 4 ROST LAMB LEG 1 77.3963 265 20 20 9
2.6 5 ROAST LAMB 1 42.0616 300 18 25 9 2.3 6 SMOKED HAM 1 2.4311 340 20 28 9 2.5 7 PORK ROAST 1 1.9132 340 19 29 9 2.5 8 PORK SIMMERED 1 13.1779 355 19 30 9 2.4
6
命名為高肥胖食物群 (高卡洛里 , 肥胖指數高 ,低鈣 )
(8 Cases)
85Exhibit 7.5 (continued)
CLUSTER=2 OBS NAME CLUSTER DISTANCE CALORIES PROTEIN FAT CALCIUM IRON 9 HAMBURGER 2 70.9576 245 21 17 9 2.7 10 CANNED BEEF 2 7.8135 180 22 10 17 3.7 11 BROILED CHICKEN 2 59.9964 115 20 3 8 1.4 12 CANNED CHICKEN 2 6.3070 170 25 7 12 1.5 13 BEEF HEART 2 16.4369 160 26 5 14 5.9 14 BEEF TONGUE 2 31.3971 205 18 14 7 2.5 15 VEAL CUTLET 2 10.9841 185 23 9 9 2.7 16 BAKED BLUEFISH 2 42.0215 135 22 4 25 0.6 17 FRIED HADDOCK 2 40.2403 135 16 5 15 0.5 18 BROILED MACKEREL 2 26.7634 200 19 13 5 1.0 19 FRIED PERCH 2 21.2850 195 16 11 14 1.3 20 CANNED TUNA 2 7.9719 170 25 7 7 1.2
命名為中肥胖食物群 (低卡洛里 , 肥胖指數低 ,低鈣 )
(12 Cases)
86Exhibit 7.5 (continued)
CLUSTER=3 OBS NAME CLUSTER DISTANCE CALORIES PROTEIN FAT CALCIUM IRON 21 RAW CLAMS 3 34.7046 70 11 1 82 6.0 22 CANNED CLAMS 3 60.5092 45 7 1 74 5.4 23 CANNED CRABMEAT 3 63.9273 90 14 2 38 0.8 24 CANNED MACKEREL 3 79.6672 155 16 9 157 1.8 25 CANNED SALMON 3 61.7127 120 17 5 159 0.7 26 CANNED SHRIMP 3 14.8809 110 23 1 98 2.6
命名為低卡洛里高鈣食物群 (6 Cases)