IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal,...

25
IRF 1 What’s different with Chinese in cross- language IR? Jian-Yun Nie University of Montreal, Canada

Transcript of IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal,...

Page 1: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 1

What’s different with Chinese in cross-language IR?

Jian-Yun NieUniversity of Montreal,

Canada

Page 2: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 2

Outline

General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in

Chinese IR Solutions?

Page 3: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 3

1. General characteristic of Chinese

Sentence = ideograms with no separation它是一种适于在拖拉机使用的转向球接头,…

Words?它 / 是 / 一种 / 适于 / 在 / 拖拉机 / 使用 / 的 / 转向 / 球 / 接头 / ,…

Page 4: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 4

Word formation

Each character can be a word ( 人 -person)

Most words are composed of two or more characters ( 人群 -mass)

However No clear definition of the notion of word

办公楼 (office building) / 办公楼 / or / 办公 / 楼 /? Inconsistency in manual segmentation Many new words are created (abbreviations)

E.g. 网络 (network) 管理员 (administrator) 网管 ( webmaster)

Page 5: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 5

2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval

Segmentation Ambiguity: more than 1 segmentation possibility e.g. “ 发展中国家”

发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) 发展 (development)/ 中国 (China)/ 家 (family)

Different words have similar meaning接头 (connector, plug) ↔ 插头 (plug) ↔ 插座 (plug)

New words can be formed quite freely接 (reception) 桶 (bucket): Not a common word, but can be used 网 (network) 店 (store): more and more used… 的 (of, taxi) 车 (car): taxi car (?), car of (someone)…

Page 6: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 6

Alternative: n-grams Usually unigrams and bigrams

As effective as using a word segmentation Account for some flexibility

However Noise: non meaningful combinations Wrong combinations

非酿造型啤酒 (non-brewed beer) 非/酿造/型/啤酒 非酿/酿造/造型/型啤/啤酒

Style, appearance, …

Non-meaningful

Page 7: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 77

Possible approach: Combining words and n-grams 前年收入有所下降

Score function in language modeling similar to other languages

Previous results: Word ~ bigram > unigram

Chinese Mono-lingual IR

Word: 前年 / 收入 / 有所 /下降 or: 前 / 年收入 / 有所 /下降

Unigram: 前 / 年 / 收 / 入 / 有 / 所 / 下 / 降

Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降

Page 8: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 8

Our recent tests

Chinese Monolingual IR (Query: Title)

Collec-tions

W B U WU BU0.3W+0.7U

0.3B+0.7U

W+B+U

TREC5 .2585 .2698 .3012 .3298 .3074 .3123 .3262 .3273

TREC6 .3861 .3628 .3580 .4220 .3897 .4090 .3880 .4068

NTCIR3 .2609 .2492 .2496 .2606 .2820 .2754 .2840 .2862

NTCIR4 .1996 .2164 .2371 .2254 .2350 .2431 .2429 .2387

NTCIR5 .2974 .3151 .3390 .3118 .3246 .3452 .3508 .3470

Average .2805 .2827 .2970 .3099 .3077 .3170 .3184 .3212

Page 9: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 9

Why is this useful? NTCIR 5 Topic 18

烟 草 商 诉 讼 赔 偿 (Tobacco company, suit, compensation) Word: 烟草商 (Tobacco company) 诉讼 (suit) 赔偿 (compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草 , 公司 , 业者 , 香烟 , 烟商 , but cannot match “ 烟

草商” .

NTCIR 5 Topic 24 经 济 舱 综 合 症 候 群 航 班 (Economy class, syndrome, flight) Word: 经济 (economy) 综合症 (syndrome) 候 (wait) 航班 (flight) Ubigram(.7607)>Word(0.0002) “.. 综合症候 ..”  is segmented into “../ 综合症 /候 /..” It cannot match “ 症候” (syndrome).

The combination of words with unigrams or bigrams helps

Page 10: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 10

Also works for Korean and Japanese?

Run

Means Average Precision (MAP)

U B W BU WU 0.3B+0.7U

Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax Rigid Relax

C-C-T-N4 .1929 .2370 .1670 .2065 .1679 .2131 .1928 .2363 .1817 .2269 .1979 .2455

C-C-T-N5 .3302 .3589 .2713 .3300 .2676 .3315 .2974 .3554 .3017 .3537 .3300 .3766

J-J-T-N4 .2377 .2899 .2768 .3670 − − .2807 .3722 − − .2873 .3664

J-J-T-N5 .2376 .2730 .2471 .3273 − − .2705 .3458 − − .2900 .3495

K-K-T-N4 .2004 .2147 .3873 .4195 − − .4084 .4396 − − .3608 .3889

K-K-T-N5 .2603 .2777 .3699 .3996 − − .3865 .4178 − − .3800 .4001

Page 11: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 11

2. CLIR: query translation Machine translation: rules+dictionaries Statistical translation model:

Parallel texts Automatically extract possible translations

Comparison Stat. TM doe not produce human-readable

translations But can include related words

Usually, word-based translation

Page 12: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 12

Our recent tests: also translate into n-grams

English Word

•Chinese Word•Chinese Unigram•Chinese Bigram•Bigram&Unigram

“history and civilization” || “ 历史文明”…

history / and / civilization

|| 历史 / 史文 / 文明…

TM (word-to-bigram):p( 历史 |history)p( 史文 |history)p( 文明 |history)

GIZA++ training

history / and / civilization || 历 / 史 / 文 / 明

TM (word-to-unigram):

p( 历 |history)p( 史 |history)p( 文 |history)

GIZA++ training

… …

Page 13: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 13

Combining different translations

English Query Chinese Documents

j jjiU QePeutQ )|()|(:

Q j jjiB QePebtQ )|()|(:

UD

DBD

j jjiW QePebtQ )|()|(: WD

Page 14: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 14

Bilingual linguistic resources for CLIR

An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs

LDC English-Chinese bilingual dictionaries 42,000 entries Translation model

Combination of the 2 translation models

Page 15: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 15

CLIR results

EnglishChinese CLIR

Collec-tions

W B U WU BU0.3W

+0.7U

0.3B+0.7U

TREC5 .1904 .2003 .1922 .2448 .2277 .2158 .2251

TREC6 .2047 .2293 .2602 .2670 .2772 .2672 .2822

NTCIR3 .1288 .1017 .1536 .1628 .1504 .1619 .1495

NTCIR4 .0956 .0953 .1382 .1410 .1308 .1337 .1286

NTCIR5 .1158 .1323 .1762 .1532 .1462 .1682 .1602

Average

.1470 .1518 .1841 .1938 .1865 .1894 .1891

Page 16: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 16

General observations for Chinese IR

Using both words and n-grams for Chinese IR and Chinese query translation

N-grams can account for flexibility in Chinese words

CLIR with Chinese can also benefit from translations into Chinese n-grams

Page 17: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 17

4. OOV problem in Chinese

OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV

Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually

Solutions Parallel texts (translations by n-grams) Mono-lingual corpus

Page 18: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 18

Translation of named entities

Statistical transliteration Frances Taylor 弗朗西斯泰勒

茀琅希思泰勒弗郎西丝泰勒 …

Page 19: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 19

Page 20: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 20

Candidate extraction Templates Four templates to extract candidates

c1c2..cn (En) c1c2..cn , En, c’1c’2..c’m

c1c2..cn: En c1c2..cn 是 / 即 En

Comparing four templates Use template 1 in following experiments

Template Percentage Precision

1 17.65% 54%

2 68.35% 6.5%

3 9.05% 2.5%

4 4.94% 1%

Table 2: Comparing Precision of the Four Templates

Page 21: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 21

Translation model

Train a translation model

Candidate List

Page 22: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 22

Dictionary Mining Results Mining Results

Processed more than 300GB Chinese web pages 161,117 translation pairs are mined

Translation % Transliteration % Accuracy %

53.55 46.45 90.15

Table 4: Accuracy of Mined Dictionary

Page 23: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 23

Coverage of the Dictionary on Query Log Data

9,065 popular English terms from the MSN Chinese search engine

Page 24: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 24

CLIR experiment

Page 25: IRF1 What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada.

IRF 25

Conclusions In addition to the general approaches, Chinese IR

should also consider the characteristics of the language

(also for other Asian languages – Japanese and Korean)

Difficulty in translating new (technical) words and proper names

Exploit parallel/comparable or monolingual texts Additional problem: make the retrieved document

readable Full text translation

Running sentences in patent: relatively easy Technical terms: may be difficult with Chinese

Gisting: translation assistance tool, useful for a user with some knowledge of the document language