楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its...

38
楊楊楊 Iunn Un-gian 2008.7.14 楊楊楊楊楊楊楊 楊楊楊楊楊楊 Written Taiwanese : Its Characteristic Analysis and Processing Techniques

Transcript of 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its...

Page 1: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

楊允言 Iunn Un-gian

2008.7.14

台語文特性分析及其處理技術

Written Taiwanese : Its Characteristic Analysis and Processing Techniques

Page 2: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

2

Vita• 1984-1988 NTU CSIE under

• 1990/8-1994/1 Sinica IIS assistant

• 1991-1993 NTHU IS graduate

• 1994/2-1996/11 NTU CC programmar

• 1996 – migrate to Hualian

Page 3: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

3

Vita-2

• 1999 Dahan I.T. CSIE lecturer

• 2003/8 - assistant prof.

• 2004 - NTU CSIE phD program

• Journal : IJCLCLP 12(4)

• Project : NSC 3, NMTL 1, Academia Historica 1

Page 4: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

4

Outline

1.Introduction

2.Resources and Survey of Written Taiwanese Processing

3.Coding and I/O of POJ

4.Tone Sandhi Problem and Algorithm

Page 5: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

5

Outline-2

5.Word Segmentation and Tagging Methods

6.Corpora Collection and Annotation

7.Some Applications of Written Taiwanese Corpora

8.Conclusion and Future Work

Page 6: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

6

1. Introduction1.1 Background

–Population : 46M (2005)

–Distribution : Taiwan, Singapore, Malaysia, Brunei, China, Thailand, Philippines, Indonesia

–Rank : 21

–Confused Name : Southern-Min ? Amoy ? Taiwanese ?

Page 7: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

7

1. Introduction-2

1.2 Different Scripts–Han Characters Script

–Romanization Script (POJ)

–Han-Romanization Mixed Script

–Others : Kana, Phonetic Symbols, Proverb, …

Page 8: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

8

1. Introduction-3

1.3 Phoneme of the Taiwanese–Initials (18)

–Vowels (86)

–Tones (7)

–Compared with Mandarin : legal syllable 2726 vs 1200

Page 9: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

9

1. Introduction-4

1.4 Some Keypoints–Not yet standardized

–The POJ characters are seperated to different zones in Unicode set

–Need to Annotate phonetic marker in corpora

–Interact with Taiwanese group

Page 10: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

10

1. Introduction-5

1.5 Motivation–My mother tongue

1.6 Definition and Glossary

1.7 Goal of This Dissertation

1.8 Organization

Page 11: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

11

2. Resources and Survey

2.1 Resources–Input method

–Dictionary

–Corpus

–Word segmentation

–Scripts conversion

–Text-to-speech

2.2 Survey

Page 12: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

12

3. Coding and I/O of POJ

3.1 POJ Character Code–Unicode encoding

3.2 Two Kinds of POJ Representation–POJ and numbered POJ

Page 13: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

13

3. Coding and I/O of POJ-2

3.3 Retrieval of POJ–Issue : both case-sensitive

and case-insensitive

–2-stage retrieval : excute SQL command and then filtering

–Fuzzy retrieval : toneless, glottal stop, checked syllable, vowel

–Examples

Page 14: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

14

3. Coding and I/O of POJ-3

3.4 Display of POJ–Strategy : Unicode (with

specific fonts) or graph–POJ to numbered POJ

• lâng la5ng lang5

–Numbered POJ to POJ• lang5 la5ng lâng• Priority : o a e u i n m• ou..5o5u ou5 ô.

Page 15: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

15

3. Coding and I/O of POJ-4

3.5 Word Processing Utilities for POJ–Phoneme segmentation :

backward direction

–Spelling checker

–Syllable / word / sentence count

Page 16: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

16

4. Tone Sandhi4.1 Tone Sandhi Problem

–Types of tone sandhi• Normal sandhi

• Following sandhi

• Neutral sandhi

• Double sandhi

• Pre-á sandhi

• Triplicate sandhi

• Rising sandhi

Page 17: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

17

4. Tone Sandhi-2

4.1 Tone Sandhi Problem–Most complicate among the

Sino language family

–Need to find the boundary of tone sandhi group

Page 18: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

18

4. Tone Sandhi-3

Page 19: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

19

4. Tone Sandhi-4

4.2 Implementation–Training and test data : POJ

–Tag set : A(adj) C(conj) D(adv) G(postposition) I(interjection) M(special marker) N(noun) P(prep) R(pron) S(time) T(aux) V(verb)

–Taiwanese-Mandarin dict & Chinese electronic dict

Page 20: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

20

4. Tone Sandhi-5

4.3 Rule-based Algorithm–20 rules

–Syllable / word / POS / sentence level

4.4 Result–Training data : 97.39%

–Test data : 88.98%

Page 21: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

21

5. Word Seg and Tagging

5.1 Word Segmentation–For Han-Romanization mixed

– Forward maximal matching (FMM) vs Backward maximal matching (BMM)• … 看台語… :

看台 語 (FMM) or 看 台語 (BMM)?

–Ambiguous : statistic• P( 看 )×P( 台語 ) >> P( 看台 )×P

( 語 )

Page 22: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

22

5. Word Seg and Tagging-2

5.2 POS Tagging–Data : POJ and HR mixed

parallel corpus

–Tag set : CKIP Chinese tagset

–Taiwanese-Mandarin dict

–Chinese bigram training data

Page 23: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

23

5. Word Seg and Tagging-3

Page 24: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

24

5. Word Seg and Tagging-4

5.2 POS Tagging– Example :

• 因為 [in-ūi]{ 由於 ; 因為 }< 因為 >(Cbb)等待 [tán-thāi]{ 留待 ; 等待 }< 等待 >(VK)朋友 [pêng-iú]{ 友人 ; 朋友 }< 朋友 >(Na), [,]<,>(COMMACATEGORY)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)心適 [sim-sek]{ 好玩 ; 好玩兒 ; 有趣 ; 風趣 ;愉快 ; 稀奇 ; 鬧著玩 }< 有趣 >(VH)

Page 25: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

25

5. Word Seg and Tagging-5

5.2 POS Tagging–Result : 91.49%–Error analysis :

• Wrong Chinese translation word

• No best Chinese translation to select

• Unknown word• Proper noun• Propogation error

Page 26: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

26

6. Collect/Annotate Corpora

6.1 Corpora Collection–POJ (3M+ syllables)

–Han-Romanization Mixed (5M+ syllables)

–Sources : • Project results

• Articles in magazines

• Academic paper

Page 27: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

27

6. Collect/Annotate Corpora-2

6.2 Raw Corpus Pre-process–Space between “-” and char

–“-” between Han char and POJ

–Alignment

Page 28: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

28

6. Collect/Annotate Corpora-3

6.3 Corpus Annotation–POS

–Semantic annotation

–Phonetic annotation

–Special pattern marker

Page 29: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

29

7. Corpora Applications

7.1 Basic Count–Syllable / word count

–Zipf law

–Proportion of POJ in Han-Romanization mixed script

–Suggestion of othpgraphy for unconsistent word usage

Page 30: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

30

7. Corpora Applications-2

7.2 Concordancer system–For language learning

–For systax study

7.3 Collocation–MI & Correlation (χ2)

–VN, NV, AN, NN

Page 31: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

31

7. Corpora Applications-3

7.4 Lexical Change and Variation–Two periods : before / after

1945

–Register : • Japanese loanwords

• Mandarin loanwords

• church register

Page 32: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

32

7. Corpora Applications-4

7.4 Lexical Change and Variation–Two Taiwanese bible

versions (new testament) : 1916 and 1972

–Dialect difference–Common words : 31%–43% words disappered after

5 decades

Page 33: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

33

7. Corpora Applications-5

7.5 Language Learning and Test

7.6 Coarticulation

Page 34: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

34

7. Corpora Applications-6

7.7 POJ / HR mixed script conversion–POJ to HR mixed

• Kin-a2-jit8 thinn-khi3 chin ho2 今仔日天氣真好

• Lookup dictionary

• Bigram , unigram ( 5M syllables training data )

• (input method)

Page 35: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

35

7. Corpora Applications-7

7.7 POJ / HR mixed script conversion–HR mixed to POJ

• 今仔日天氣真好 Kin-a2-jit8 thinn-khi3 chin ho2

• Word segmentation

• Loopup dictionary

• Bigram,unigram (3M syllables/ words training data)

Page 36: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

36

8. Future Work8.1 Summary

8.2 Future Work–Parser

–Machine translation

–OCR

–Put corpora to LDC

Page 37: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

37

8. Future Work I wish this dissertation will

turn into be a written Taiwanese processing textbook ( written in Taiwanese or Mandarin )

Page 38: 楊允言 Iunn Un-gian 2008.7.14 台語文特性分析 及其處理技術 Written Taiwanese : Its Characteristic Analysis and Processing Techniques.

敬請指教 Kèng-chhián chí-kàuPlease advise.