What’s New in Statistical Machine Translation

Post on 13-Jan-2016

65 views 0 download

description

What’s New in Statistical Machine Translation. USC/Information Sciences Institute USC/Computer Science Department. Kevin Knight. Machine Translation. 美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。. - PowerPoint PPT Presentation

Transcript of What’s New in Statistical Machine Translation

What’s New in Statistical Machine Translation

Kevin Knight USC/Information Sciences Institute

USC/Computer Science Department

Machine Translation

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。

The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.

Thousands of Languages Are SpokenMANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000

HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000

WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000

TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000

TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000

GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000

ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000

KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000

Source: Ethnologue

Recent Progress in Statistical MT

insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

20022002 20032003

newsbroadcast

foreignlanguagespeechrecognition

Englishtranslation

searchablearchive

20052005

Warren Weaver (1947)

ingcmpnqsnwf cv fpn owoktvcv

hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv ...

Warren Weaver (1947)

e e e e ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Warren Weaver (1947)

e e e the ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Warren Weaver (1947)

e he e the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Warren Weaver (1947)

e he e of the fofingcmpnqsnwf cv fpn owoktvcv e f o e o oe thu ihgzsnwfv rqcffnw cw owgcnwf efkowazoanv ...

Warren Weaver (1947)

e he e of theingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...

Warren Weaver (1947)

e he e is the sisingcmpnqsnwf cv fpn owoktvcv e s i e i ie thu ihgzsnwfv rqcffnw cw owgcnwf eskowazoanv ...

Warren Weaver (1947)

decipherment is the analysisingcmpnqsnwf cv fpn owoktvcvof documents written in ancienthu ihgzsnwfv rqcffnw cw owgcnwflanguages ...kowazoanv ...

Warren Weaver (1947)

The non-Turkish guy next to me is even deciphering Turkish! All he needs is a

statistical table of letter-pair frequencies in Turkish …

Collected mechanically from aTurkish body of text, or corpus

Can this becomputerized?

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

“... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful.”

- Norbert Wiener, April 1947

Spanish/English corpus

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Spanish/English corpus

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Translate: Clients do not sell pharmaceuticals in Europe.

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process ofelimination

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

- Warren Weaver, March 1947

The required statistical tables have millions of entries…?

Too much for the computers of Weaver’s day.

Not enough RAM!

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

Canadianbureaucrat

IBM’s John Cocke,inventor of CKY parsing & RISC processors

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

Canadianbureaucrat

IBM’s John Cocke,inventor of CKY parsing & RISC processors

IBM Candide Project (1988-1994)

• How to get quantities of human translation in computer readable form?– parallel corpus

IBM’s John Cocke,inventor of CKY parsing & RISC processors

IBM Candide Project[Brown et al 93]

French BrokenEnglish

English

French/EnglishBilingual Text

EnglishText

Statistical Analysis Statistical Analysis

J’ ai si faim

What hunger have I,Hungry I am so,I am so hungry,Have me that hunger …

I am so hungry

Mathematical Formulation

J’ ai si faim I am so hungry

TranslationModel P(f | e)

LanguageModel P(e)

Decoding algorithm

argmaxe P(e) · P(f | e)

Given source sentence f:

argmaxe P(e | f) =

argmaxe P(f | e) · P(e) / P(f) = by Bayes Rule

argmaxe P(f | e) · P(e) P(f) same for all e

French BrokenEnglish

English

Language Modeling

Goal of a language model for MT:

He is on the soccer fieldHe is in the soccer field

Is table the on cup theThe cup is on the table

American shrineAmerican company

Need to make thesedecisions, becausetranslation model maynot have a lot of context information!

The Classic Language ModelWord Bigrams

Process model of English: Generate each word based only on the previous word.

P(I saw water on the table) =

P(I | START) ·P(saw | I) ·P(water | saw) ·P(on | water) ·P(the | on) ·P(table | the) ·P(END | table)

Probabilities can be tabulatedfrom an online English corpus …just like Weaver’s Turkish case.

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking

[Soricut & Marcu, 05]

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking

the banking trustco is said to expand its purchase part of its royal international plan operations

[Soricut & Marcu, 05]

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations

the banking trustco is said to expand its purchase part of its royal international plan operations

[Soricut & Marcu, 05]

N-grams have a lot of semantics in them!

Trigram Language Model

to the saidroyal purchase plan trustcopart operations of its its is

international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations

[Soricut & Marcu, 05]

the banking trustco is said to expand its purchase part of its royal international plan operations

with the stressed relationship part

own longstanding its its for chinese boeing , ,

for its part, stressed the longstanding relationship with its own, chinese boeing

boeing, for its part, stressed its own longstanding relationship with the chinese

Translation Model?

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

Process model of translation:

Translation Model?

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Source-language morphological analysis

Source parse tree

Semantic representation

Generate target structure

What are allthe possiblemoves andwhat probabilitytables controlthose moves?

Process model of translation:

The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]

Mary did not slap the green witch

Mary not slap slap slap the green witch

Maria no dió una bofetada a la bruja verde

Mary not slap slap slap NULL the green witch

Maria no dió una bofetada a la verde bruja

Trainable

Process model of translation:

n(3|slap) 50k entries

d(j|i) 2500 entries

P-Null 1 entry

t(la|the) 25m entries

The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]

Mary did not slap the green witch

Maria no dió una bofetada a la bruja verde

Still trainable!

Process model of translation:

n(3|slap) 50k entries

d(j|i) 2500 entries

P-Null 1 entry

t(la|the) 25m entries

?

Classic Formula for P(f | e)

P(f | e) =

m – Φ0 lΣ ( ) · P-Null m – 2Φ0 · (1-P-Null) Φ0 · Φi! · (1 /

Φ0!) ·a Φ0 i=0

l m m n(Φi | ei) · t(fj | eaj) · d(j | aj, l, m) i=1 j=1 j:a j <> 0

fertility word translation re-ordering

NULL stuff

Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.

sum overalignment

possibilities

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

All P(french-word | english-word) equally likely

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“la” and “the” observed to co-occur frequently,so P(la | the) is increased.

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“maison” co-occurs with both “the” and “house”, butP(maison | house) can be raised without limit, to 1.0,

while P(maison | the) is limited because of “la”

(pigeonhole principle)

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

settling down after another iteration

Unsupervised EM Training

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

Inherent hidden structure revealed by EM training!

• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.

• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)

• Software: GIZA++

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

Translation Model

Sample Translation Probabilities

[Brown et al 93]

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

Translation Model

potentialtranslation e

P(f | e)

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

Translation Model

Language Model

w1 w2 P(w2 | w1)

of

the 0.13

a 0.09

another 0.01

some 0.01

hong

kong 0.98

said 0.01

stated 0.01

potentialtranslation e

P(f | e) P(e)

e f P(f | e)

national

nationale 0.47

national 0.42

nationaux 0.05

nationales 0.03

the

le 0.50

la 0.21

les 0.16

l’ 0.09

ce 0.02

cette 0.01

farmers

agriculteurs 0.44

les 0.42

cultivateurs 0.05

producteurs 0.02

new Frenchsentence f

P(f | e) · P(e) score for e

Translation Model

Language Model

w1 w2 P(w2 | w1)

of

the 0.13

a 0.09

another 0.01

some 0.01

hong

kong 0.98

said 0.01

stated 0.01

potentialtranslation e

P(f | e) P(e)

Search for Best Translation

voulez – vous vous taire !

Search for Best Translation

voulez – vous vous taire !

you – you you quiet !

Search for Best Translation

voulez – vous vous taire !

you – you quiet !

Search for Best Translation

voulez – vous vous taire !

quiet you – you you !

Search for Best Translation

voulez – vous vous taire !

shut you – you you !

Search for Best Translation

voulez – vous vous taire !

you shut !

Search for Best Translation

voulez – vous vous taire !

you shut up !

Classic Decoding Algorithm

Given f, find the English string e that maximizes P(e) · P(f | e)

NP-Complete [Knight 99].

Brown et al 93:“In this paper, we focus on the

translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”

Beam Search Decoding[Brown et al US Patent #5,477,451]

1st Englishword

2nd Englishword

3rd Englishword

4th Englishword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

[Jelinek 69; Och, Ueffing, and Ney, 01]

1st Englishword

2nd Englishword

3rd Englishword

4th Englishword

start end

Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)

all sourcewords

covered

best predecessorlink

[Jelinek 69; Och, Ueffing, and Ney, 01]

Beam Search Decoding[Brown et al US Patent #5,477,451]

Classic Results• nous avons signé le protocole . (Foreign Original)• we did sign the memorandum of agreement . (Human Translation)• we have signed the protocol . (MT)

• où était le plan solide ? (Foreign Original)• but where was the solid plan ? (Human Translation)• where was the economic base ? (MT)

the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and

very slow = one page per day

Okay!

I know, so far, this talk should be called …

What’s Old in Statistical

Machine Translation!!

Further Developments

• Follow-on projects– Hong Kong– Aachen– Behavior Design Corporation

• JHU Summer Workshop 1999– Build & distribute statistical MT tools– Create standard training & testing data– Disseminate tutorial material– “MT in a Day”– Ask new questions

How Much Data Do We Need?

Amount of bilingual training data

Quality of automatically trainedmachine translationsystem

Advances in Statistical MT2000-2004

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

+ European parliament data [Koehn 05]

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

BLEU Evaluation Metric(Papineni et al 02)

• N-gram precision (score is between 0 & 1)

What percentage of machine n-grams can be found in the reference translation?

Gross measure over 1000 test sentences.Not allowed to use same portion of reference

translation twice (can’t cheat by typing out “the the the the the”)

Brevity penalty: can’t just type out single word “the” (and get precision 1.0)

BLEU in Action

枪手被警方击毙。 (Foreign Original)

the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10

BLEU in Action

枪手被警方击毙。 (Foreign Original)

the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10

green = 4-gram match (good!)red = word not matched (bad!)

BLEU Tends to Predict Human Judgments

R2 = 88.0%

R2 = 90.2%

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Human Judgments

BL

EU

Sc

ore

Adequacy

Fluency

Linear(Adequacy)Linear(Fluency)

slide from G. Doddington (NIST)

Experiment-Driven Progress

Mar1

Apr1

15

20

25

30

35

May1

BLEU

2005

ISI Syntax-Based MTChinese/EnglishNIST 2002 Test Set

Evaluate new MT research ideas every day!(and be alerted about bugs…)

Draw Learning Curves

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

10k 20k 40k 80k 160k 320k

Swedish/EnglishFrench/EnglishGerman/EnglishFinnish/English

# of sentence pairs used in training

BLEUscore

Experiments byPhilipp Koehn

Flaws of Word-Based MT

• Can’t translate multiple English words to one French word

• Can’t translate phrases– “real estate”, “note that”, “interest in”

• Isn’t sensitive to syntax– Adjectives/nouns should swap order– Verb comes at the beginning in Arabic

• Doesn’t understand the meaning (?)

The MT Triangle

SOURCE TARGET

words words

syntax syntax

logical form

interlingua

logical form

The MT Swimming Pool

syntax syntax

interlingua

words words

logical form logical form

SOURCE TARGET

Commercial Rule-Based Systems

words words

syntax syntax

logical form

interlingua

logical form

Knight et al 95 - meaning-based translation - composition rules

Language Model

SOURCE TARGET

words words

syntax syntax

logical form

interlingua

logical form

Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text

SOURCE TARGET

Language Modelwords words

syntax syntax

logical form

interlingua

logical form

Yamada/Knight (01,02) - tree/string model - used existing target language parser

SOURCE TARGET

Language Modelwords words

syntax syntax

logical form

interlingua

logical form

Well, these all seem like good ideas.

Which one had the most dramatic effect on MT quality?

None of them!

Phrases

SOURCE TARGET

phrases phrases

How do you translate“real estate” into French?

real estate real number dance number dance card memory card memory stick …

words words

syntax syntax

logical form

interlingua

logical form

Phrase-Based Statistical MT

• Foreign input segmented into phrases– “phrase” just means “word sequence”

• Each phrase is probabilistically translated into English– P(to the conference | zur Konferenz)– P(into the meeting | zur Konferenz)

• Phrases are probabilistically re-ordered

See [Koehn et al, 2003] for an overview.

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference In Canada

How to Learn the Phrase Translation Table?

• One method: “alignment templates” [Och et al 99]

• Start with word alignment

• Collect all phrase pairs that are consistent with the word alignment

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

Mary

did

not

slap

the

green

witch

Maria no dió una bofetada a la bruja verde

Word Alignment Induced Phrases

(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

(a la, the) (dió una bofetada a, slap)

(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)

(a la bruja verde, the green witch) …

(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

Phrase Pair Probabilities

• A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.

• No EM training

• Just relative frequency:

count(f-f-f, e-e-e) P(f-f-f | e-e-e) = ----------------------- count(e-e-e)

Phrase-Based MT

• This is currently the best way to do Statistical MT!

• What took so long to move from words to phrases?– Missing RAM

• 25m parameters billions of parameters• Trick idea: build test-corpus-specific phrase table (takes 5 hours!)• Now solved in commercial deployments

– Missing computing power– Many competing ideas to shake out

• Koehn 03 summarizes several variations– Empirical effectiveness even better than intuition would predict

• This is not building a ladder to the moon!– If you can’t translate “real estate” into French, you are sunk

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e) x P(f | e) e

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) … works better! e

Advanced Training Methods

argmax P(e | f) = e

argmax P(e) x P(f | e) / P(f) = e

argmax P(e)2.4 x P(f | e) x length(e)1.1

eRewards longer hypotheses, since these are unfairly punished by P(e)

Advanced Training Methods

argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 …

e

Lots of features vote on every potential translation.

Exponential model.

Problem: How to set the exponent weights?

IDEA 1: maximize probability of the dataIDEA 2: maximize BLEU score of MT system

20.64% BLEU

WTM fixed at 1.0

17.96% BLEU

plot by Emil Ettelaie

Maximum BLEU Training

• Novel algorithm developed by [Och 03]

• Opened gates to “feature hacking”– Word-based feature to smooth phrase pair

counts (“Model1 Inverse”)– Phrase-specific propensities to re-order

• Currently limited to ~25 features

Advances in Statistical MT2005

Google’s Language Model

• Previously, largest language model was trained on 1b words of English

• 20b words of news significant impact on news translation

• 200b words of web helpful

Maryland’s Hiero system [Chiang 05]

• Previously:– ne mange pas does not eat

• New phrase pairs with variables and reordering– ne X pas does not X

– le X1 du X2 X2 's X1

• Nesting– “does not X” itself becomes an X

• CKY decoderJohnCocke

ISI’s Syntax-Based MT System

• First strong showing for an SMT system that knows what nouns and verbs are!

• Why syntax?“Frequent high-tech exports are bright spots for

foreign trade growth of Guangdong has made important contributions.”

– Need much more grammatical output– Need accurate control over re-ordering– Need accurate insertion of function words

The gunman killed by police .

String Output

. 击毙警方被枪手

The gunman killed by police .

DT NN VBD IN NN

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

Gunman by police shot .

NN IN NN VBD

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

The gunman was killed by police .

DT NN AUX VBN IN NN

NPB PP

NP-C VP

S

Tree Output

. 击毙警方被枪手

Sample Rules Learned from Data

"说 " "," x0 0.57

"说 " x0 0.09

"他 " "说 " "," x0 0.02

"指出 " "," x0 0.02

x0 0.02

VP

VB SBAR

IN x0:S

that

said

NP

x0:NP PP

IN x1:NP

from

x1 x0 0.27

"来自 " x1 x0 0.15

x1 "的 " x0 0.06

"从 " x1 x0 0.06

"来自 " x1 "的 " x0 0.06

Sample Rules Learned from Data

S

x0:NP VP

x1:VB x2:NP

x0 x1 x2 0.82

x0 x1 "," x2 0.02

x0 x1 x2 0.54 x1 x0 x2 0.44

S

x0:NP VP

x1:VB x2:NP

(Chinese/English)

(Arabic/ English)

subject-verb inversion

Format is Expressive

S

x0:NP VP

x1:VB x2:NP2

x1, x0, x2

S

PRO VP

VB x0:NPthere

are

hay, x0

NP

x0:NP PP

of

P x1:NP

x1, , x0

Multilevel Re-Ordering

Non-constituent Phrases

Lexicalized Re-Ordering

VP

VBZ VBG

is

está, cantando

Phrasal Translation

singing

VP

VB x0:NP PRT

put

poner, x0

Non-contiguous Phrases

on

NPB

DT x0:NNS

the

x0

Context-SensitiveWord Insertion

[Knight & Graehl, 2005]

Story Gets More Interesting…

Applications

MT

Automata Theory

TreeTransducers(Rounds 70)

Story Gets More Interesting…

Applications

MT

Automata Theory

TreeTransducers(Rounds 70)

Linguistic Theory

TransformationalGrammar

(Chomsky 57)

Story Gets More Interesting…

Applications

MT (05)

Automata Theory

TreeTransducers(Rounds 70)

Compression (01)

QA (03)

Generation (00)

Linguistic Theory

TransformationalGrammar

(Chomsky 57)

Story Gets More Interesting…

Applications

Algorithms

Linguistic Theory

Automata Theory

TransformationalGrammar

(Chomsky 57)

TreeTransducers(Rounds 70)

Efficient TransducerAlgorithms

Generic TreeToolkits

MT (05)

Compression (01)

QA (03)

Generation (00)

Summary

• Making good progress

• Algorithms + Data + Evaluation + Computers

• Interdisciplinary work– Natural language processing– Machine learning– Linguistics– Automata theory

• Hope that more people will join!

Thank you

Syntax-Based vs Phrase-Based

Mar1

Apr1

15

20

25

30

35

May1

BLEUphrase-based system

2005

Chinese/EnglishNIST 2002 Test Set

Future PhD Theses?

“Syntax-based Language Models for Improving Statistical MT”

“Discriminative Training of Millions of Features for MT”

“Semantic Representations Induced from Multilingual EU and UN Data”

“What Makes One Language Pair More Difficult to Translate Than Another”

“A State-of-the-Art MT System Based on Syntactic Transformations”

“New Training Methods for High-Quality Word Alignment”

+ many unpredictable ones…

Summary

• Phrase-based models are state-of-the-art– Word alignments– Phrase pair extraction & probabilities– N-gram language models– Beam search decoding– Feature functions & learning weights

• But the output is not English– Fluency must be improved– Better translation of person names, organizations, locations– More automatic acquisition of parallel data, exploitation of

monolingual data across a variety of domains/languages– Need good accuracy across a variety of domains/languages

Available Resources• Bilingual corpora

– 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)– Lots of French/English, Spanish/French/English, LDC– European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI

• (www.isi.edu/~koehn/publications/europarl)– 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI

• (www.isi.edu/natural-language/download/hansard/)

• Sentence alignment– Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)– Xiaoyi Ma, LDC (Champollion)

• Word alignment– GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)– GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)– Manually word-aligned test corpus (500 French/English sentence pairs), RWTH

Aachen– Shared task, NAACL-HLT’03 workshop

• Decoding– ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)– ISI Pharoah phrase-based decoder

• Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)

• Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)

Some Papers Referenced on Slides• ACL

– [Och, Tillmann, & Ney, 1999]– [Och & Ney, 2000]– [Germann et al, 2001]– [Yamada & Knight, 2001, 2002]– [Papineni et al, 2002]– [Alshawi et al, 1998]– [Collins, 1997]– [Koehn & Knight, 2003]– [Al-Onaizan & Knight, 2002]– [Och & Ney, 2002]– [Och, 2003]– [Koehn et al, 2003]

• EMNLP– [Marcu & Wong, 2002]– [Fox, 2002]– [Munteanu & Marcu, 2002]

• AI Magazine– [Knight, 1997]

• www.isi.edu/~knight– [MT Tutorial Workbook]

• AMTA– [Soricut et al, 2002]– [Al-Onaizan & Knight, 1998]

• EACL– [Cmejrek et al, 2003]

• Computational Linguistics– [Brown et al, 1993]– [Knight, 1999]– [Wu, 1997]

• AAAI– [Koehn & Knight, 2000]

• IWNLG– [Habash, 2002]

• MT Summit– [Charniak, Knight, Yamada, 2003]

• NAACL– [Koehn, Marcu, Och, 2003]– [Germann, 2003]– [Graehl & Knight, 2004]– [Galley, Hopkins, Knight, Marcu, 2004]

Ready-to-Use Online Bilingual Data

0

20

40

60

80

100

120

140

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).

Millions of words(English side)

+ 1m-20m words formany language pairs

Ready-to-Use Online Bilingual Data

020

406080

100120140

160180

1994

1996

1998

2000

2002

2004

Chinese/English

Arabic/English

French/English

Millions of words(English side)

One Billion?

???

From No Data to Sentence Pairs

• Easy way: Linguistic Data Consortium (LDC)• Really hard way: pay $$$

– Suppose one billion words of parallel data were sufficient– At 20 cents/word, that’s $200 million

• Pretty hard way: Find it, and then earn it!– De-formatting– Remove strange characters– Character code conversion– Document alignment– Sentence alignment– Tokenization (also called Segmentation)

Sentence Alignment

The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.

El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.

Sentence Alignment

1. The old man is happy.

2. He has fished many times.

3. His wife talks to him.

4. The fish are jumping.

5. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Sentence Alignment

1. The old man is happy.

2. He has fished many times.

3. His wife talks to him.

4. The fish are jumping.

5. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Sentence Alignment

1. The old man is happy. He has fished many times.

2. His wife talks to him.

3. The sharks await.

1. El viejo está feliz porque ha pescado muchos veces.

2. Su mujer habla con él.

3. Los tiburones esperan.

Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).

Tokenization (or Segmentation)

• English– Input (some byte stream):

"There," said Bob.– Output (7 “tokens” or “words”): " There , " said Bob .

• Chinese– Input (byte stream):

– Output:

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件。

美国 关岛国 际机 场 及其 办公 室均接获 一名 自称 沙地 阿拉 伯

富 商拉登 等发 出 的 电子邮件。

Lower-Casing

• English– Input (7 words):

" There , " said Bob .

– Output (7 words):

" there , " said bob .

Thethe“The“the

theSmaller vocabulary size.More robust counting and learning.

Idea of tokenizing and lower-casing:

Recent Progress in Statistical MT

• Why is that?– Better algorithms that learn patterns from data– More data– Faster, cheaper computers with more RAM– Community-wide test sets– Novel automated evaluation methods– Shared software tools

Three Problems for Statistical MT

• Translation model– Given a pair of strings <f,e>, assigns P(f | e) by formula– <f,e> look like translations high P(f | e)– <f,e> don’t look like translations low P(f | e)

• Language model– Given an English string e, assigns P(e) by formula– good English string high P(e)– random word sequence low P(e)

• Decoding algorithm– Given a language model, a translation model, and a new

sentence f … find translation e maximizing P(e) · P(f | e)

Web Language Models

She has a lot of nerve. [20]French input It has a lot of nerve. [3]

?

Used by Google in 2005 to increase performance of their research MT system!

[Soricut, Knight, Marcu, 02]