Segmentation Similarity and Agreement
description
Transcript of Segmentation Similarity and Agreement
![Page 1: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/1.jpg)
Segmentation Similarityand Agreement
A metric for evaluating automatic andhuman segmenters
Chris Fournier Diana Inkpen
School of Electrical Engineering and Computer ScienceUniversity of Ottawa
June 4, 2012
1 / 37
![Page 2: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/2.jpg)
What is segmentation?Introduction
Figure: Baker (1990, pp. 76–77)2 / 37
![Page 3: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/3.jpg)
What is segmentation?Introduction
Par. Topic1–3 Intro - the search for life in space4–5 The moon’s chemical composition6–8 How early earth-moon proximity shaped the moon
9–12 How the moon helped life evolve on earth13 Improbability of the earth-moon system
14–16 Binary/trinary star systems make life unlikely17–18 The low probability of nonbinary/trinary systems19–20 Properties of earth’s sun that facilitate life
21 Summary
Figure: Hyp. segmentation (Hearst 1997, p. 33)
3 / 37
![Page 4: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/4.jpg)
Why do we segment?Introduction
To model topical shifts, aiding:
É Video and audio retrieval(Franz et al. 2007)
É Question answering(Oh et al. 2007)
É Subjectivity analysis(Stoyanov & Cardie 2008)
É Automatic summarization(Haghighi & Vanderwende 2009)
4 / 37
![Page 5: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/5.jpg)
Types of segmentationIntroduction
Linear
s1 3 2 3 1
Hierarchical
5
3
1 1 1
2
1 1
5 / 37
![Page 6: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/6.jpg)
Automatically segmentationIntroduction
Many automatic segmenters exist:
É TextTiling(Hearst 1997)
É Minimum Cut segmenter(Malioutov & Barzilay 2006)
É Bayesian segmenter(Eisenstein & Barzilay 2008)
É Affinity Propagation for Segmentation(Kazantseva & Szpakowicz 2011)
6 / 37
![Page 7: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/7.jpg)
Problem: selecting a segmenterIntroduction
How do we select the best performingsegmenter for a task?
É Ideally evaluate performance in situÉ Evaluate end-task performance while
varying segmentersÉ Attain ecological validity1
É “. . . the ability of experiments to tell us howreal people operate in the real world”(Cohen 1995, p. 102)
É This is time consuming and expensive
1For an example study, see McCallum et al. (2012)7 / 37
![Page 8: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/8.jpg)
Problem: selecting a segmenterIntroduction
How do we less expensively select thebest performing segmenter for a task?
1. Identify/collect manual segmentations
2. Verify their reliability
3. Train an automatic segmenter
4. Compare automatic and manualsegmentations using a metric
8 / 37
![Page 9: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/9.jpg)
FocusIntroduction
We focuses on comparing segmentationsto evaluate:
É Manual segmentations reliability
É Automatic segmenter performance
9 / 37
![Page 10: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/10.jpg)
Why is this comparison difficult?Difficulty
Difficulty arises because:
É There is no one “true” segmentationÉ Low manual agreement (Hearst 1997)É Coders disagree on granularity (Pevzner
& Hearst 2002)
É Few boundaries to agree uponHearst (1993, p. 6)
É Near misses often occur betweenboundaries
10 / 37
![Page 11: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/11.jpg)
No one “true” segmentationDifficulty
1234567
Figure: 7 manual codings collected by Hearst(1997) of Stargazers Look for Life (Baker 1990)
11 / 37
![Page 12: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/12.jpg)
Near missesDifficulty
0 5 10 15 20
0
500
1,000
1,500
Distance considered as a near miss (PBs)
Mis
ses
Full Near
Figure: S of Kazantseva & Szpakowicz (2012)12 / 37
![Page 13: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/13.jpg)
Existing evaluation metricsEvaluation Metrics
Existing segmentation evaluation metrics:
É Precision, Recall, Fβ-measureÉ Does not discount near-misses
É Pk (Beeferman & Berger 1999)É Window-based near-miss accountingÉ Not stable (Pevzner & Hearst 2002)
É WindowDiff (Pevzner & Hearst 2002)É Substantial modification of Pk
É More stable (Pevzner & Hearst 2002)
13 / 37
![Page 14: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/14.jpg)
Stability & internal segment sizesEvaluation Metrics
1−WDS
0.6
0.7
0.8
0.9
1
Met
ric
valu
e
(20,30) (15,35) (10,40) (5,45)
Figure: 10 trials of 100 segs. w/ FP & FN p = 0.514 / 37
![Page 15: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/15.jpg)
Common failingsEvaluation Metrics
Existing segmentation evaluation metrics:
É Require one “true” referenceÉ Cannot use multiple manual codings
É Cannot be adapted for agreementÉ Pairwise means must be permutedÉ WD(s1, s2) 6= WD(s2, s1)
15 / 37
![Page 16: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/16.jpg)
A new metric: SSegmentation Similarity
Segmentation Similarity (S):É New boundary edit distance
É Edit distance used to penalize error
É Scales and normalizes penalties inrelation to segment mass
S is ideal because it is:É A minimum edit distance (stable)
É Symmetric (no “true” segmentation)
É Highly configurable
16 / 37
![Page 17: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/17.jpg)
ParametersSegmentation Similarity
S has three parameters:n the number of PBs considered a
near miss (default is 2)
TE (y/n), to use transposition errorscaling, or not (default is yes)
Weights upon error types to reduce theirseverity (default is 1PB each)
17 / 37
![Page 18: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/18.jpg)
Mass and potential boundariesSegmentation Similarity
Segmentations have:É Potential boundaries separating units
É Mass measured in units
É Types of boundaries.
0 1 2 3 4 5 6
⇒ 1 3 2
Figure: Annotation of segmentation mass
18 / 37
![Page 19: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/19.jpg)
Modelling dissimilaritySegmentation Similarity
Linear segmentation errors can bemodeled as edit operations at positions:
1 n-wise transposition2,3,4 substitutions
s1
s2
FP FN FP FN FN
1 2 3 4
Figure: Types of segmentations errors
19 / 37
![Page 20: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/20.jpg)
NormalizationSegmentation Similarity
S(si1, si2) =mass(i)−1−d(si1,si2)
mass(i)−1
20 / 37
![Page 21: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/21.jpg)
Calculating similaritySegmentation Similarity
From the previous example:É 4 edits (3 sub. and 1 transposition)É 14 units of mass.
s1
s2
1 2 3 4
S(si1, si2) =14− 1− 4
14− 1=
9
13= 0.6923
1−WD = 0.615421 / 37
![Page 22: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/22.jpg)
Near missesSegmentation Similarity
S can scale near misses by PBs spanned:
te(n,b) = b− (1/b)n−2 where n ≥ 2 and b > 0
s1
s2
6 8
7 7
S = 0.92311−WD = 0.8182
22 / 37
![Page 23: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/23.jpg)
Increasing near miss span sizeSegmentation Similarity
0 2 4 6 8 10
0.7
0.8
0.9
1
Difference in position (units)
1−WDS(n = 3)
S(n = 5,scale)S(n = 5,wtrp = 0)
23 / 37
![Page 24: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/24.jpg)
Reliability of manual codingsSegmentation Agreement
How do we verify manual reliability?É Inter-coder agreement coefficients: 2 3
κ, π, κ∗, and π∗ =Aa − Ae
1− Ae
É Adapt to use Segmentation Similarity:
κS, πS, κ∗S, and π∗
S
2Fleiss’s Multi-π (π∗) is Siegel & Castellan’s (1988) κ3Formulations from Artstein & Poesio (2008) used
24 / 37
![Page 25: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/25.jpg)
CategoriesSegmentation Agreement
Calculate Ae using one category per t:É boundary presence (K = {segt|t ∈ T})
Why?É Coders either place a boundary or not
É Coders do not place non-boundaries
É We desire boundary agreementÉ “Unsure”, “no choice” are not options
É Default is no boundary placement
25 / 37
![Page 26: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/26.jpg)
Examples of manual codingsMultiply-Coded Corpora
Linear multiply-coded segmentations:
É Kazantseva & Szpakowicz (2012)É The Moonstone by Wilkie CollinsÉ Topically segmented by 4-6 codersÉ Paragraph-level
É Hearst (1997)É Stargazers Look for Life by Dan BakerÉ Topically segmented by 7 codersÉ Paragraph-level
26 / 37
![Page 27: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/27.jpg)
Overall agreementMultiply-Coded Corpora
Kazantseva & Szpakowicz (2012)Mean coder group π∗
S0.8923± 0.0377
Mean S 0.8885± 0.0662
Hearst (1997)π∗
S0.7514
Mean S 0.7619± 0.0706
27 / 37
![Page 28: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/28.jpg)
Overall error typesMultiply-Coded Corpora
MissesFull Near
Kazantseva & Szpakowicz (2012) 1039 212Hearst (1997) 72 28
K&S 2012H 1997
Sub. Transp. PBs w/o error
28 / 37
![Page 29: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/29.jpg)
Comparing segmentersEvaluation
How can we compare auto segmenters?
É Pairwise mean S with manual codings2
1 3 mean(S1,S2,S3)
É Statistical hypothesis testing
29 / 37
![Page 30: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/30.jpg)
Comparing segmentersEvaluation
How can we compare auto segmenters?
É Differences in agreement1. Calculate manual coder agreement
π∗S,3M
2. Recalculate agreement adding anautomatic segmenter’s values
π∗S,3M,1A
3. Compare the two agreement values
30 / 37
![Page 31: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/31.jpg)
SummaryConclusion
Segmentation Similarity (S)É Stable, unlike window metrics
É Highly configurable
É Gives detailed error information
É Mean values can be used to performstatistical hypothesis tests
É Adapted inter-annotator agreementÉ Quantify manual agreement & reliabilityÉ Compare automatic segmenters in terms
of human performance
31 / 37
![Page 32: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/32.jpg)
Future work & ImplementationConclusion
Future workÉ Multiple boundary types; and
É Hierarchical segmentation.
Software implementation
http://nlp.chrisfournier.ca/
32 / 37
![Page 33: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/33.jpg)
References I
Artstein, R. & Poesio, M. (2008), ‘Inter-coder agreement forcomputational linguistics’, Computational Linguistics34(4), 555–596.
Baker, D. (1990), ‘Stargazers look for life’, South Magazine117, 76–77.
Beeferman, D. & Berger, A. (1999), ‘Statistical models fortext segmentation’, Machine learning 34(1-3), 177–210.
Cohen, P. R. (1995), Empirical methods for artificialintelligence, Cambridge, MA, USA.
Eisenstein, J. & Barzilay, R. (2008), Bayesian unsupervisedtopic segmentation, in ‘Proceedings of the Conference onEmpirical Methods in Natural Language Processing’,number October, Association for ComputationalLinguistics, Morristown, NJ, USA, pp. 334–343.
33 / 37
![Page 34: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/34.jpg)
References IIFranz, M., Mccarley, J. S., Xu, J.-m., Systems, H. I. & Search, I.
(2007), User-Oriented Text Segmentation EvaluationMeasure, in ‘Proceedings of the 30th annual internationalACM SIGIR conference on Research and development ininformation retrieval’, number 1, pp. 701–702.
Haghighi, A. & Vanderwende, L. (2009), Exploring contentmodels for multi-document summarization, in‘Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics’, NAACL ’09,Association for Computational Linguistics, Stroudsburg,PA, USA, pp. 362–370.
Hearst, M. A. (1993), TextTiling: A Quantitative Approach toDiscourse, Technical report.
Hearst, M. A. (1997), ‘TextTiling: segmenting text intomulti-paragraph subtopic passages’, ComputationalLinguistics 23(1), 33–64.
34 / 37
![Page 35: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/35.jpg)
References IIIKazantseva, A. & Szpakowicz, S. (2011), Linear Text
Segmentation Using Affinity Propagation, in ‘Proceedingsof the 2011 Conference on Empirical Methods in NaturalLanguage Processing’, Association for ComputationalLinguistics, Edinburgh, Scotland, UK., pp. 284–293.
Kazantseva, A. & Szpakowicz, S. (2012), TopicalSegmentation: a Study of Human Performance, in‘Proceedings of the Human Language Technologies: The2012 Annual Conference of the North American Chapter ofthe Association for Computational Linguistics (HLT’12)’,Association for Computational Linguistics.
Malioutov, I. & Barzilay, R. (2006), Minimum cut model forspoken lecture segmentation, in ‘Proceedings of the 21stInternational Conference on Computational Linguistics andthe 44th annual meeting of the Association forComputational Linguistics’, ACL-44, Association forComputational Linguistics, Stroudsburg, PA, USA,pp. 25–32.
35 / 37
![Page 36: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/36.jpg)
References IVMcCallum, A., Munteanu, C., Penn, G. & Zhu, X. (2012),
Ecological validity and the evaluation of speechsummarization quality, in ‘Proceedings of the NAACL HLT2012 Workshop on Evaluation Metrics and SystemComparison for Automatic Summarization’, Association forComputational Linguistics.
Oh, H.-J., Myaeng, S. H. & Jang, M.-G. (2007), ‘Semanticpassage segmentation based on sentence topics forquestion answering’, Information Sciences177(18), 3696–3717.
Pevzner, L. & Hearst, M. (2002), ‘A critique and improvementof an evaluation metric for text segmentation’,Computational Linguistics 28(1), 19–36.
Siegel, S. & Castellan, N. (1988), Nonparametric Statistics forthe Behavioral Sciences, second edn, McGraw-Hill, Inc.
36 / 37
![Page 37: Segmentation Similarity and Agreement](https://reader033.fdocument.pub/reader033/viewer/2022052903/55795581d8b42ab6648b49fd/html5/thumbnails/37.jpg)
References V
Stoyanov, V. & Cardie, C. (2008), Topic identification forfine-grained opinion analysis, in ‘Proceedings of the 22ndInternational Conference on Computational Linguistics -Volume 1’, COLING ’08, Association for ComputationalLinguistics, Stroudsburg, PA, USA, pp. 817–824.
37 / 37