Post on 30-Mar-2015
Symbolic and statistical Analyses of meta-data
using the “Semana” platform —
a bundle of tools for the KDD research
Georges Sauvet (CNRS, Toulouse)
Centre de Recherche et d’Etude de l’Art Préhistorique
UMR 5608: Travaux et Recherches Archéologiques sur les Cultures, les Espaces et les Sociétés
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
CASK Sorbonne 2008, Paris, June 13th
SEMANA and Data MiningSEMANA and Data Mining
sampling
Data coding
KDD techniques(Rough Set,
FCA, statistical analysis, etc.)
interpretation
Datawarehouse
After B. Wüthrich, 1998
“SEMANA”, a bundle of tools aimed at makink these tasks easier
Architecture of the SEMANA platform
Architecture of the SEMANA platform A software bundle written in Transcript®, the programming language of Revolution®
Standalone applications for Macintosh and Windows
Dynamic DB BuilderDynamic DB Builder
Data sheetsData codingData storage
Formal Concept Analysis
Formal Concept Analysis Statistical toolsStatistical tools
Galois lattice“central concepts”
Correlation MatrixCorrespondence Factor Analysis, Hierarchical Classifications
(Wille, Ganter)
(Benzecri)
Tables (various formats)
“Multi-valued tables” “One-valued tables”
Tree Builder AssistantTree Builder Assistant
Aid to code structuration
Rough Set TheoryRough Set Theory Decision LogicDecision Logic
Upper approx.Lower approx.Reducts, CoreDiscriminating power
Minimal rulesAttribute strength
(Pawlak)
(Bolc, Cytowski and Stacewicz)
Attribute EditorAttribute Editor
DiscretizationLogical scaling …
Working with the SEMANA platform Working with the SEMANA platform
Three illustrations:
“Ten-ta-to”: the proximal deictic adjectives in Polish
The category of Aspect in Polish
Representations of women in Palaeolithic Art
SEMANA is twofold:
1) Tools for Intelligent Database Designing => Dynamic DB Builder
• providing statistical information about the use of AV • suggesting iterative restructuration of AV
2) Tools for KDD research : integration of RST, FCA, Statistical Data Analyses
Case 1:
the Proximal Deictic Adjectives in Polish
Case 1:
the Proximal Deictic Adjectives in Polish
The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish
Case = {Nominative, Accusative, Genitive, Dative, Instrumental, Locative}
Number = {singular, plural}
Gender = In Polish Linguistics (cf. SALONI, Z. 1976), up to 7 gender classes have been proposed:
In Polish School Grammar, the adjective declension consists in the amalgamation of three “morphological categories”.
Singular :1. feminine2. neuter3. animal masculine (“animal” corresponds to the feature “animate” in
other European languages descriptions)
4. non animal masculinePlural :• personal masculine (“personal” corresponds to the feature “human”)
• non personal masculine1. “pluralia tantum” (defective nouns with no singular form).
The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish
The root of these adjectives is a single phoneme t-.
13 forms are used: ten, ta, to, tym, tymi, tych, te, te*,temu, tej, tego, ta*,ci
Examples (only Nominative case)
Polish English translationSingular Plural
Masculine ten dom te domy this/these house(s)ten pies te psy this/these dog(s)ten pan ci panowie this/these sir(s)
Feminineta deska te deski this/these board(s)ta gęś te gęsi this/these goose/geeseta pani te panie this/these lady/ladies
Neuterto pióro te pióra this/these feather(s)to kurczę te kurczęta this/these chicken(s)to dziecko te dzieci this/these child/children... ... ...
The proximal deictic adjectives in Polish The proximal deictic adjectives in Polish
In order to elucidate the problem of Gender in Polish noun morphology,
H. and A. Wlodarczyk have built a database of usages of the proximal
deictic adjectives.
As the 7 “sub-genders” of Polish School Grammars neither correspond to
any known semantic or ontological categories nor to any known
grammatical sub-gender in other languages, they proposed to split the
“sub-genders” of the Gender attribute into three attributes :
gender = {feminine, neuter, masculine)animacy = {animate, inanimate}humanity = {human, non_human}
TENTATO: database first version TENTATO: database first version
sample
morpheme
attribute, value(features chosen for each entry)
An AV Table is automatically collected
TENTATO: database first version TENTATO: database first version Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0
Attributes = 5 (with resp. 6,2,3,2,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75%==================================================The following pairs of attributes could be merged:[hum|ina] Confidence index = 99.9%[hum|nhu] Confidence index = 99.9%[ina|nhu] Confidence index = 99.9%==================================================STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36
Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18
Gnd fem 36 Gnd masc 36 Gnd neu 36
Hum hum 36 Hum nhum 72
Nb plur 54 Nb sing 54 ==================================================Non-Attested Pairs of Values = 1ina,hum,2,4--------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%--------------------------------------------------
Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0
Attributes = 5 (with resp. 6,2,3,2,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 144 Apparent Saturation Index : 75%==================================================The following pairs of attributes could be merged:[hum|ina] Confidence index = 99.9%[hum|nhu] Confidence index = 99.9%[ina|nhu] Confidence index = 99.9%==================================================STATISTICAL USE OF AV Attr Value occur Ani anim 72 Ani inanim 36
Case A 18 Case D 18 Case G 18 Case I 18 Case L 18 Case N 18
Gnd fem 36 Gnd masc 36 Gnd neu 36
Hum hum 36 Hum nhum 72
Nb plur 54 Nb sing 54 ==================================================Non-Attested Pairs of Values = 1ina,hum,2,4--------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%--------------------------------------------------
The program suggests the possibility to merge these attributes
The program indicates that the pair {inanimate-human} does not exist (for obvious reason)
TENTATO (Version 1): Formal Concept Analysis TENTATO (Version 1): Formal Concept Analysis
TENTATO
Version 1
simplified lattice
complete lattice
Inanimate depends on non human
Human depends on animate
Test of dependenceTest of dependenceTotal Dependence
ina => nhu (36/36)
hum => an (36/36)
High probability (>90%):
none
Total Dependence
ina => nhu (36/36)
hum => an (36/36)
High probability (>90%):
none
TENTATO: second version TENTATO: second version
Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0
Attributes = 4 (with resp. 3,6,3,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100%==================================================No attributes could be merged================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36
CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18
GND feminine 36 GND masculine 36 GND neuter 36
NBR plural 54 NBR singular 54 ==================================================Non-Attested Pairs of Values = 0-------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%
Objects = 108 Distinct objects = 108 Duplicates = 0 Duplicate ratio = 0
Attributes = 4 (with resp. 3,6,3,2 values)NB: in this calculation, non-used attributes (*) have been replaced by a null value ('nAtt')================================================== Theoretical Number of Combinations = 108 Apparent Saturation Index : 100%==================================================No attributes could be merged================================================== STATISTICAL USE OF AV Attr Value occur ANY human 36 ANY inanimate 36 ANY nhuman 36
CAS accusative 18 CAS dative 18 CAS genetive 18 CAS instrumental 18 CAS locative 18 CAS nominative 18
GND feminine 36 GND masculine 36 GND neuter 36
NBR plural 54 NBR singular 54 ==================================================Non-Attested Pairs of Values = 0-------------------------------------------------Assuming that all non-attested pairs are impossible: Maximum number of combinations = 108 Corrected Saturation Index : 100%
In a second trial, the attributes
ANIMACY ({ANI}=[animate|inamimate]) and HUMANITY ({HUM}=[human|nhuman])
are merged into a three-valued attribute :
{ANY}=[nhuman|inanimate|human]
No attribute merging is possible; all pairs of values are attested.
TENTATO: Formal Concept Analysis TENTATO: Formal Concept Analysis
TENTATO
Version 1
simplified lattice
complete lattice
TENTATO
Version 2
simplified lattice
complete lattice
All the attributes at the same level : no hierarchy
Total Dependence
none
High probability (>90%):
none
Total Dependence
none
High probability (>90%):
none
Total Dependence
ina => nhu (36/36)
hum => an (36/36)
High probability (>90%):
none
Total Dependence
ina => nhu (36/36)
hum => an (36/36)
High probability (>90%):
none
Test of dependence =>
Inanimate depends on non humanHuman depends on
animate
TENTATO-2: Rough Set Theory and “Minimal Rules” TENTATO-2: Rough Set Theory and “Minimal Rules”
r1 (9) : CASdat,NBRplu --> tymr2 (3) : CASins,GNDmas,NBRsin --> tymr3 (3) : CASins,GNDneu,NBRsin --> tymr4 (3) : CASloc,GNDmas,NBRsin --> tymr5 (3) : CASloc,GNDneu,NBRsin --> tym
r6 (9) : CASins,NBRplu --> tymi
r7 (1) : CASacc,ANYhum,GNDmas,NBRplu --> tychr8 (9) : CASgen,NBRplu --> tychr9 (9) : CASloc,NBRplu --> tych
r10 (3) : CASacc,GNDneu,NBRsin --> tor11 (3) : CASnom,GNDneu,NBRsin --> to
r12 (3) : CASacc,ANYina,NBRplu --> ter13 (3) : CASacc,ANYnhu,NBRplu --> ter14 (3) : CASacc,GNDfem,NBRplu --> ter15 (3) : CASacc,GNDneu,NBRplu --> ter16 (3) : CASnom,ANYina,NBRplu --> ter17 (3) : CASnom,ANYnhu,NBRplu --> ter18 (3) : CASnom,GNDfem,NBRplu --> ter19 (3) : CASnom,GNDneu,NBRplu --> te
r20 (1) : CASacc,ANYina,GNDmas,NBRsin --> tenr21 (3) : CASnom,GNDmas,NBRsin --> ten
r22 (3) : CASdat,GNDmas,NBRsin --> temur23 (3) : CASdat,GNDneu,NBRsin --> temu
r24 (3) : CASdat,GNDfem,NBRsin --> tejr25 (3) : CASgen,GNDfem,NBRsin --> tejr26 (3) : CASloc,GNDfem,NBRsin --> tej
r27 (1) : CASacc,ANYhum,GNDmas,NBRsin --> tegor28 (1) : CASacc,ANYnhu,GNDmas,NBRsin --> tegor29 (3) : CASgen,GNDmas,NBRsin --> tegor30 (3) : CASgen,GNDneu,NBRsin --> tego
r31 (3) : CASacc,GNDfem,NBRsin --> te*
r32 (3) : CASnom,GNDfem,NBRsin --> ta
r33 (3) : CASins,GNDfem,NBRsin --> ta*
r34 (1) : CASnom,ANYhum,GNDmas,NBRplu --> ci
The 108 distinct objects of the DB can be described by only 34 morphological rules.Note that CAS and NBR are required in every rule, GND in 26/34 and ANY in only 9/34.
A procedure derived from Rough Set Theory allows us to calculate the “minimal rules” (i.e. the values of the attributes which condition the morpheme to be used)
TENTATO-2: Statistical analysis TENTATO-2: Statistical analysis
The Multi-valued Table is unfolded in a One-value Table...
…and the One-value Table is transformed in a Burt’s Table…
A Burt’s Table is a square symmetrical table giving the number of cooccurrences of the attributes
TENTATO-2: Correspondence Factor Analysis (CFA) TENTATO-2: Correspondence Factor Analysis (CFA)
BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45
BURT TABLE acc dat gen ins loc nom hum ina nhu fem mas neu plu sin ci ta ta* te te* tego tej tem ten to tych tym tymi acc 18 0 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 8 3 2 0 0 1 3 1 0 0 dat 0 18 0 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 6 0 0 0 9 0 gen 0 0 18 0 0 0 6 6 6 6 6 6 9 9 0 0 0 0 0 6 3 0 0 0 9 0 0 ins 0 0 0 18 0 0 6 6 6 6 6 6 9 9 0 0 3 0 0 0 0 0 0 0 0 6 9 loc 0 0 0 0 18 0 6 6 6 6 6 6 9 9 0 0 0 0 0 0 3 0 0 0 9 6 0 nom 0 0 0 0 0 18 6 6 6 6 6 6 9 9 1 3 0 8 0 0 0 0 3 3 0 0 0 hum 6 6 6 6 6 6 36 0 0 12 12 12 18 18 1 1 1 4 1 3 3 2 1 2 7 0 0 ina 6 6 6 6 6 6 0 36 0 12 12 12 18 18 0 1 1 6 1 2 3 2 2 2 6 7 3 nhu 6 6 6 6 6 6 0 0 36 12 12 12 18 18 0 1 1 6 1 3 3 2 1 2 6 7 3 fem 6 6 6 6 6 6 12 12 12 36 0 0 18 18 0 3 3 6 3 0 9 0 0 0 6 3 3 mas 6 6 6 6 6 6 12 12 12 0 36 0 18 18 1 0 0 4 0 5 0 3 4 0 7 9 3 neu 6 6 6 6 6 6 12 12 12 0 0 36 18 18 0 0 0 6 0 3 0 3 0 6 6 9 3 plu 9 9 9 9 9 9 18 18 18 18 18 18 54 0 1 0 0 16 0 0 0 0 0 0 19 9 9 sin 9 9 9 9 9 9 18 18 18 18 18 18 0 54 0 3 3 0 3 8 9 6 4 6 0 12 0 ci 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 ta 0 0 0 0 0 3 1 1 1 3 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 ta* 0 0 0 3 0 0 1 1 1 3 0 0 0 3 0 0 3 0 0 0 0 0 0 0 0 0 0 te 8 0 0 0 0 8 4 6 6 6 4 6 16 0 0 0 0 16 0 0 0 0 0 0 0 0 0 te* 3 0 0 0 0 0 1 1 1 3 0 0 0 3 0 0 0 0 3 0 0 0 0 0 0 0 0 tego 2 0 6 0 0 0 3 2 3 0 5 3 0 8 0 0 0 0 0 8 0 0 0 0 0 0 0 tej 0 3 3 0 3 0 3 3 3 9 0 0 0 9 0 0 0 0 0 0 9 0 0 0 0 0 0 tem 0 6 0 0 0 0 2 2 2 0 3 3 0 6 0 0 0 0 0 0 0 6 0 0 0 0 0 ten 1 0 0 0 0 3 1 2 1 0 4 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 to 3 0 0 0 0 3 2 2 2 0 0 6 0 6 0 0 0 0 0 0 0 0 0 6 0 0 0 tych 1 0 9 0 9 0 7 6 6 6 7 6 19 0 0 0 0 0 0 0 0 0 0 0 19 0 0 tym 0 9 0 6 6 0 7 7 7 3 9 9 9 12 0 0 0 0 0 0 0 0 0 0 0 21 0 tymi 0 0 0 9 0 0 3 3 3 3 3 3 9 0 0 0 0 0 0 0 0 0 0 0 0 0 9 FJ 90 90 90 90 90 90 180 180 180 180 180 180 270 270 5 15 15 80 15 40 45 30 20 30 95 105 45
Numbers in the Table are considered as coordinates of points in a N-dimensional space.
•• •••••• •
•••••• •••
• ••••• •••• ••
••• •••
•• •••• •••• •••••• •
••••••
••••
z
x
y
F1
F2
F3
CFA calculates the axes of inertia of the cloud of points (F1, F2, F3 …)
and displaysprojections in planes [F1,F2], [F1,F3], etc.
CFA is implemented in“Semana”
TENTATO-2: Correspondence Factor Analysis (CFA) TENTATO-2: Correspondence Factor Analysis (CFA)
CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |
CLOUD J FREQ QLT INR | F#1 COR CTR | F#2 COR CTR | F#3 COR CTR | F#4 COR CTR | —————————————————————————————————————————————————————————————————————————————————————————— acc 33 397 40 | -70 3 1 | -745 391 120 | 47 2 1 | -48 2 1 | dat 33 588 42 | 643 271 88 | 483 153 50 | 97 6 2 | 489 157 66 | gen 33 596 40 | -15 0 0 | 153 16 5 | -870 522 186 | -290 58 23 | ins 33 869 48 | -362 77 28 | 682 271 100 | 924 499 210 | -195 22 11 | loc 33 326 35 | -117 11 3 | 366 106 29 | -504 201 62 | -103 8 3 | nom 33 633 44 | -78 4 1 | -938 556 190 | 306 59 23 | 147 14 6 | hum 67 5 23 | 0 0 0 | 29 2 0 | -34 3 1 | 5 0 0 | ina 67 5 23 | -1 0 0 | -26 2 0 | 34 3 1 | 6 0 0 | nhu 67 0 22 | 1 0 0 | -3 0 0 | 0 0 0 | -11 0 0 | fem 67 768 33 | 20 1 0 | -26 1 0 | 85 12 4 | -669 754 247 | mas 67 245 28 | -9 0 0 | 56 6 1 | -97 19 5 | 332 220 61 | neu 67 232 28 | -11 0 0 | -29 2 0 | 12 0 0 | 337 229 63 | plu 100 873 30 | -546 823 189 | 43 5 1 | -68 13 3 | 108 32 10 | sin 100 873 30 | 546 823 189 | -43 5 1 | 68 13 3 | -108 32 10 | ci 2 76 36 | -644 18 5 | -841 30 8 | 128 1 0 | 804 28 10 | ta 6 276 40 | 497 29 9 |-1046 127 39 | 545 35 12 | -856 85 34 | ta* 6 445 40 | 207 5 2 | 635 47 15 | 1278 190 67 |-1321 203 80 | te 30 651 44 | -630 225 75 | -839 399 135 | 148 12 5 | 156 14 6 | te* 6 265 40 | 505 30 9 | -845 83 26 | 237 7 2 |-1121 146 58 | tego 15 249 42 | 516 79 25 | -91 2 1 | -751 167 62 | -6 0 0 | tej 17 588 42 | 749 187 60 | 274 25 8 | -324 35 13 |-1012 341 142 | temu 11 559 44 | 1200 306 102 | 469 47 16 | 145 4 2 | 973 201 87 | ten 7 208 39 | 469 34 10 | -917 132 40 | 262 11 4 | 440 30 12 | to 11 308 41 | 468 50 15 | -948 204 65 | 305 21 8 | 378 32 13 | tych 35 812 44 | -624 263 87 | 264 47 16 | -858 498 191 | -86 5 2 | tym 39 481 35 | 214 43 11 | 527 256 70 | 174 28 9 | 408 154 54 | tymi 17 726 47 | -924 251 91 | 752 166 61 | 1016 304 127 | -119 4 2 |
Note that the number (singular/plural) has the highest contrib. to axis 1
Note that the quality of the description of attribute “animacy” is very poor: these elements have no contribution to the first 4 factors.
Contribution of object J to the definition of factor 1
Contribution of factor 1 to the description of object J
Coordinate of object J on factor 1
Output by “Stat-3”
TENTATO-2: CFA representation in plane [1,2] TENTATO-2: CFA representation in plane [1,2]
Output by “Stat-3”
Axis 1
Axis 2
Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins}
Axis 2 separates syntactic relators (CASE) => {nom,acc} vs {gen,loc,dat, ins}
Axis 1 separates NUMBER => singular vs pluralAxis 1 separates NUMBER => singular vs pluralANIMACY & GENDER are not differenciated on axes 1 and 2 ANIMACY & GENDER are not differenciated on axes 1 and 2
Morphemes are widely spread over plane [1,2]Morphemes are widely spread over plane [1,2]
TENTATO-2: Axis 1 separates quantifiers TENTATO-2: Axis 1 separates quantifiers
Output by “Stat-3”
Axis 1
Axis 2
plural
singular
Morphemes strictly associated to singular:
=> ta, to, ten, te*, tego, tej, temu, ta*
Morphemes strictly associated to singular:
=> ta, to, ten, te*, tego, tej, temu, ta*
One exception:
tym may be either singular or plural
One exception:
tym may be either singular or plural
Morphemes strictly associated to plural:
=> ci, te, tych, tymi
Morphemes strictly associated to plural:
=> ci, te, tych, tymi
TENTATO-2: Axis 2 separates syntactic relators TENTATO-2: Axis 2 separates syntactic relators
Output by “Stat-3”
Axis 1
Axis 2
ins
nom
acc
loc
gen
dat
Morphemes strictly associated to
genitive, locative, dative and/or instrumental:
=> tej, tych, temu, tymi, ta*, tymi
Morphemes strictly associated to
genitive, locative, dative and/or instrumental:
=> tej, tych, temu, tymi, ta*, tymi
One exception:
tego may be either
accusative or genitive
One exception:
tego may be either
accusative or genitive
Morphemes strictly associated to nominative and/or accusative:
=> ta, to, ten, te*, ci, te
Morphemes strictly associated to nominative and/or accusative:
=> ta, to, ten, te*, ci, te
Output by “Stat-3”
Axis 1
Axis 3
ins
acc
loc
gen
dat
Morphemes tymi, ta*strictly associated to instrumental
Morphemes tymi, ta*strictly associated to instrumental
One exception: tym may be either instrumental or locative
One exception: tym may be either instrumental or locative
Morphemes tych, tego, tej
strictly associated to
genitive or locative
Morphemes tych, tego, tej
strictly associated to
genitive or locative
TENTATO-2: Axis 3 separates {gen, loc} vs {inst]} TENTATO-2: Axis 3 separates {gen, loc} vs {inst]}
nom
Morphemes tego,to, ten, temu, cistrictly associated to masculine or neutral
Morphemes tego,to, ten, temu, cistrictly associated to masculine or neutral
Output by “Stat-3”
Axis 1
Axis 4
TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu} TENTATO-2: Axis 4 separates gender {fem} vs (mas, neu}
fem
mas
neu
One exception: tym may be associated to any gender
One exception: tym may be associated to any gender
Morphemes ta*, te*,tej, tastrictly associated to feminineMorphemes ta*, te*,tej, tastrictly associated to feminine
Note that the attribute[ANIMACY]={human, nhuman, inanimate}
is still not differenciated on axis 4.
Note that the attribute[ANIMACY]={human, nhuman, inanimate}
is still not differenciated on axis 4.
Output by “Stat-3”
Axis 1
Axis 9
TENTATO-2: Animacy appears only on axis 9 !!! TENTATO-2: Animacy appears only on axis 9 !!!
hum
nhu
ina
Morpheme cistrictly associated to humanMorpheme cistrictly associated to human
Axis (% inertia)
Axis 1 (13.05%) ……………………………………………………………………………………………………….
Axis 2 (12.81%) ……………………………………………………………………………………………………….
Axis 3 (11.27%) ……………………………………………………………………………………………………….
Axis 4 (10.0%) ……………………………………………………………………………………………………….
……………….. …………………………………………………………………………………………………….
Axis 9 (4.35%) ……………………………………………………………………………………………………….
TENTATO-2: CFA and “Minimal Rules” (RST) TENTATO-2: CFA and “Minimal Rules” (RST)
NUMBER CASE GENDER ANIMACY (36/36 rules) (36/36 rules) (26/36 rules) (9/36 rules)
singular plural
nom, acc gen,loc,dat,inst
gen,loc (dat) inst
feminine masculine
human nhum,ina
The relative strength of the attributes is revealed The relative strength of the attributes is revealed bothboth by their contribution by their contribution to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.to the axes of inertia in Factor Analysis and by their weight in Minimal Rules.
Case 2:
the category of Aspect in Polish
Case 2:
the category of Aspect in Polish
A Database built with “Dynamic DB-Builder” A Database built with “Dynamic DB-Builder”
A classical data sheet to fill for each specimen… Attributes and values are chosen in a list…
… and the resulting AVs appear in a field
the grammatical form of each specimen is used as index
A test of consistency A test of consistency
Each specimen is characterized by a set of AV and by its grammatical form (used as index).It may be written as a rule :
if {given set of AV} then index
This allows index inconsistencies to be detected(a test of consistency is provided in Semana)
the grammatical form of each specimen is used as index
A test of consistency A test of consistency
Each specimen is characterized by a set of AV and by its grammatical form (used as index).It may be written as a rule :
if {given set of AV} then index
This allows index inconsistencies to be detected(a test of consistency is provided in Semana)
9 different forms applying to exactly the same situation ?
the grammatical form of each specimen is used as index
This is a warning to the expert:
probably the AV do not describeproperly the different aspectual situations!
Polish Aspect using Dynamic DB Builder Polish Aspect using Dynamic DB Builder
All specimens are automatically collected in a contingency table… and statistics are reported.
In this initial version, there was
more than 2 millions of theoretical
combinations and 9 pairs of
attributes could be merged!
Polish Aspect using Dynamic DB BuilderPolish Aspect using Dynamic DB Builder
DB version Distinct objects Number of attributes
Number of theor. combin.
Number of “merging attributes”
HW-Aspect-V1 61 12 2,064,384 9
HW-Aspect-V2 60 11 1,032,192 9
HW-Aspect-V3 77 11 829,000 6
HW-Aspect-V4 79 9 408,240 1
HW-Aspect-V5 79 8 136,080 1
HW-Aspect-V6 69 8 45,360 1
HW-Aspect-V7 74 8 61,440 0
HW-Aspect-V8 78 7 58,320 0
Improvements by « trials and errors »
From Dynamic DB Builder to STAT-3From Dynamic DB Builder to STAT-3
The multi-valued table is transformed into a one-valued table for STAT analyses
Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis
Factor Analysis of the contingency table shows a clear Gutmann’s effect
(i.e. a sequential order of the attributes)
axis 1
axis 2
Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis
Ascending Hierarchical Classification shows two well-defined classes
Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis
A clear partition in two classes according to the
attribute [VAL] = {perfective | imperfective}
perfective
imperfective
Polish Aspect : Correspondence Factor AnalysisPolish Aspect : Correspondence Factor Analysis
Gutmann’s effect shows that attributes are sequentially ordered
attribute MCMP (morph. comp.) : pip > ip > pp > pi >ii
attribute MOD : parallel > sequential > trans > resume > stop > interrupt > keep > OffAndOn
Polish Aspect: Correspondence Factor AnalysisPolish Aspect: Correspondence Factor Analysis
VAL perfectiveperfective imperfectiveimperfective
MCMP pip ip pp pi ii 0 0 0 100 100
CRE defnb nRe ndefnb 0 30 89
MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100
ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84
ITS decr incr strong weak 0 0 28 54
TYP ordPr event state refPr 29 17 75 67
VAL perfectiveperfective imperfectiveimperfective
MCMP pip ip pp pi ii 0 0 0 100 100
CRE defnb nRe ndefnb 0 30 89
MOD par seq trans resume stop inter keep OaO 0 0 0 0 35 0 60 100
ANA after finish enter start end before nan begin run 0 0 0 0 33 44 69 40 84
ITS decr incr strong weak 0 0 28 54
TYP ordPr event state refPr 29 17 75 67
Distribution of features along the perfective-to-imperfective path (% association with imperfective)
All these features require imperatively perfective
Case 3 :
Images of the Woman in Palaeolithic Art
Case 3 :
Images of the Woman in Palaeolithic Art
Images of the Woman in Palaeolithic ArtImages of the Woman in Palaeolithic Art
Customized DB-builder: for each figure, AV are selected with ‘check box’ buttons
Raphaëlle Bourrillon, PhD, Univ.Toulouse-Le Mirail
Images of the Woman in Palaeolithic ArtImages of the Woman in Palaeolithic Art
CFA and HAC show three classes of representations
Realist and fatty
Realist and slim
Schematic / abstract
Detailed study of the schematic women representationsDetailed study of the schematic women representations
CFA and HAC split the schematic feminine figures into five sub-classes
Schematic / abstract
Detailed study of the schematic women representationsDetailed study of the schematic women representationsFormal concept
analysis
SEMANA : a bundle of tools for KDD research at hand in a single box
SEMANA : a bundle of tools for KDD research at hand in a single box
with applications in many domains (within and out of Linguistics!)
FROM PREPROCESSING …
… TO MINING
Building /Editing DB
- Structuration of AV - Statistics - AV edition (merging, splitting, etc.)
- Edition/conversion of tables in various formats
Complementary KDD procedures (RST, FCA ...)
… with special emphasis on the powerful tools of statistical data analyses (CFA, HAC)