Text Boundary Analysis Eric Mader Advisory Software Engineer IBM.

Post on 17-Jan-2018

244 views 0 download

description

Where do I break lines? The rain in Spain stays mainly on the plain. 您有坦率和誠實的聲譽。

Transcript of Text Boundary Analysis Eric Mader Advisory Software Engineer IBM.

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM

Where do I break lines?

The rain in Spain stays mainly on the plain.

Where do I break lines?

The rain in Spain stays mainly on the plain.

您有坦率和誠實的聲譽。

Where do I break lines?

The rain in Spain stays mainly on the plain.

ด่ๅแรงฃนึ๓อัตราลกูจา้งใหมใ่ห๓้๕

您有坦率和誠實的聲譽。

Even in English, this can be hard

You owe me $1,234.56... I think.

Even in English, this can be hard

You owe me $1,234.56... I think.

Word wrapping vs word selection

Some characters’ behavior is context-dependent.

Word wrapping:

Some characters’ behavior is context-dependent.

Some characters’ behavior is context-dependent.

Word wrapping:

Searching by words:

Word wrapping vs word selection

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

Analysis by pairs

ltr dgt sp pun

ltr

dgt

sp

pun

X

X

X

first

second

-

X

- X X

nbs

nbs

kji X X X X

kji

X

X

X

X

X

X

Where pairs break down

You owe me $1,234.56... I think.

A break position can depend on more than two characters:

Where pairs break down

You owe me $1,234.56... I think.

4.5

A break position can depend on more than two characters:

Where pairs break down

You owe me $1,234.56... I think.

6..

A break position can depend on more than two characters:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

He asked, “How tall are you?” I’m about 6 ft. tall. “Wow!”

Where pairs break down

Sentence boundaries require even more lookahead:

An example•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

The state-machine approach

start

A

’ .

0

$

%

$1,234.56...

Limitations

1992–1996

Limitations

1992–1996

Limitations

–1996

Limitations

1992–1996

Limitations

1992–1996

Limitations

1992–1996

Limitations

1992–1996

Automatic table building•If not otherwise mentioned, each character is a “word” unto itself.

•A run of letters constitutes a “word” and is kept together. Certain punctuation marks may appear inside a word, but only if they have a letter on each side.

•A run of digits constitutes a “number” and is kept together. Certain punctuation marks may appear inside a number, but only if they have a digit on each side. In addition, a number may have certain optional prefix and suffix characters.

•If a “word” and a “number” appear in succession with nothing between them, they’re kept together.

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

let=[:L:];dgt=[:N:];mid-word=[[:Pd:]\”\’\.];mid-num=[\”\’\.\,];pre-num=[[[:Sc:]\#\.]-[¢]];post-num=[\%\&¢];word=({let}+({mid-word}{let}+)*);number=({dgt}+({mid-num}{dgt}+)*);{word}?({number}{word})*({number}{post-num}?)?;{pre-num}({number}{word})*({number}{post-num}?)?;

Automatic table building

Automatic table building

•All regular-expression rules have equal precedence

•The “winning” rule is decided using a longest-possible-match algorithm (except in certain well-defined cases)

•Our build algorithm parses the regular expressions, builds the state table, and makes sure it’s deterministic in a single pass

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term{[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Sentence-break rules.*?{term}[{term}{period}{end}]*{space}*;.*?{period}[{period}{end}]*{space}*/{start}*{sent-start};

Ignore characters

$ignore=[[:Mn:][:Me:][:Cf:]];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Surrogate support

kanji=[\u4e00-\u9fff\udb80-\udb83];$ignore=[[:Mn:][:Me:][:Cf:]\udc00-\udcff];

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

You owe me $1,234.56... I think.

Random-access iteration

!{sent-start}{start}*{space}*{end}*{period};![{sent-start}{lc}{digit}]{start}*{space}*{end}*{term};

Dictionary-based iteration

We hold these truths to be self-evident: that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are Life, Liberty, and the Pursuit of Happiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

$dictionary=[A-Za-z\-\’];

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

Weholdthesetruthstobeself-evident:thatallmenare createdequal,thattheyareendowedbytheirCreatorwith certainunalienablerights,thatamongtheseareLife, Liberty,andthePursuitofHappiness.

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Dictionary-based iteration

themendinetonight

Text Boundary Analysis

Eric MaderAdvisory Software Engineer

IBM