Post on 05-Jul-2018
8/15/2019 Penn Treebank Paper Fulltext
1/25
University of Pennsylvania
ScholarlyCommons
''%0 6 (CI!) D% * C & I*% !''
O' 1993
Building a Large Annotated Corpus of English: !ePenn Treebank
Mitchell MarcusUniversity of Pennsylvania , '@'6..
Beatrice SantoriniUniversity of Pennsylvania
Mary Ann MarcinkiewicUniversity of Pennsylvania
F00 6 % %%0 6 %: ://6
8/15/2019 Penn Treebank Paper Fulltext
2/25
Building a Large Annotated Corpus of English: !e Penn Treebank
Abstract
I 6 %, ;' '6'+ 6' 0%+ %% '6-- P
%, % '6 '66+ * 4.5 00 6 * A'% E+06. D+ ?6 -
8/15/2019 Penn Treebank Paper Fulltext
3/25
Building A Large Annotated Corpus of
English: The Penn Treebank
MS-CIS-93-87
LINC LAB 26
Mitchell
P.
Marcus
Beatrice Santorini
Mary Ann Marcinkiewicz
University of Pennsylvania
School of Engineering and Applied Science
Computer and Information Science Department
Philadelphia PA
19104-6389
October
1993
8/15/2019 Penn Treebank Paper Fulltext
4/25
Building a large annotated
corpus
of
English:
the Penn Treebank
Mitchell
P.
Marcus* Ba tr ic e ~a .n to ri ni f Mary Ann Ma.rcinkiewicz*
*Department of Computer t~ ep ar tr ne lr t f Linguistics
and Information Sciences Northwestern University
Ulliversity of Pennsylvania Evanston,
IL (30208
Philadelpllia, PA 19104
Introduction
There
is
a growing consensus that significant, rapid progress can be made in both text under-
standing and spol
8/15/2019 Penn Treebank Paper Fulltext
5/25
then corrected by huinaa annot ators. Section briefly
present,^
t,he results of a, conlpa.rison between
entirely manual
and semi-automated tagging, with t he latt er being shown to be superior on three
counts: speed, consistency, and accuracy.
111
sect.ion
4 ,
we turn to the bracketing task. Jus t as with
the tagging task , we have partially aut omated the bra.cketing ta.sk: the out put of t he P O S ta,gging
phase is automat ically parsed and simplified t o yield a. skeletal syn tact ic representat ion, which is
then corrected by human annotat ors. After presenting the set of syntactic ta.gs that we use, we
illustrate and discuss the bracketing process. In part icular, we will outline va,rious factors that
affect th e speed with which anno tat ors are able to correct bracketed s truc tures, a task which-not
surprisingly-is considerably more difficult tl1a.n correctilig POS-tagged tes t . Finally, section 5
describes t he composi tion and size of the current Treeba,nk corpus, briefly reviews some of the
research pro jects th at have relied on i t to da,te, a.nd iadica.tes the directions tl1a.t the project is
likely t o take in th e future.
2
Part of speech
tagging
2.1
A
sil~lplified
OS
tagset for
English
The P O S
tag set.^
used t o allnotate large corpora, in the pa.st have tra.ditioua1ly been fairly extensive.
Th e pioneering Bro\vn Corpus distinguishes
87
sinlple ta.gs j[Fra.ncis 19641 : [Fra.ncis a.nd 1i;uEera. 19821)
and allows the forma.tion of cornpourtd tags; thus , the contra.ction I 171 , is ta.gged as
P P S S + B E M
(PPSS for non-3rd person llonlina,tive personal pronoun and
REhlI
for .'a.ln. '111 . Suhse-
quent project s have tended to elabora te the Brown Corpus tagset . For insta.nce, the Lancaster-
Oslo/Bergen (L OB) Corpus uses abou t 13.5 tags , the Lancaster IJCREI, group abou t
165
tags, and
the London-Lund C:orpus of Spoken English 197 tags. ~he ra.tiona1e behind developing such la.rge,
richly articula .ted ta,gsets is to approach the idea81 f providing dist inct codings for all cla.sses of
words having distinct grammatical behaviour ([Ga,rsideet a1 1987. 1671).
2.1.1 Recoverability
Like th e tagsets just lt~entioned, he Perm Treeba.nk ta.gset is based on that of the Brown Corpus.
However, the stochastic orientation of the Penn Treebanlc anti the resulting colicern with spa.rse
data led us to modify the Brown Corpus tagset by paring it doivil c,onsidera.bly. A key stra.tegy in
reducing th e tagset wa.s to eliminate redunda.ncy by taliing in to a.ccount hot11 lexical a,nd syntactic
informat ion. Thus, whereas many POS ta.gs in the Brown C:orpns tagse t a.re unique t o a, part icular
lesical i tem, the Perm Treebank ta.gset strives to elilllinate such instances of lesical redundancy.
For instance, the Browll Corpus distinguishes five different fornls for ma.in verbs: th e base form is
ta.gged V B , and forms with overt endings a,re indica.ted by appending
D
for past tense, G for present
participle/gerund,
N
for past pa.rticiple aad
Z
for t,hird person singular present . Esact ly the sa.me
paradigm is recognized for the h a v e , but
Ilu ve
(regardless of whether it is used as a.n ausilia,ry or a
main verb ) is a.ssigned its oxvn base tag H V . 'The Brown Corpus fi lrther dist,inguishes t,llree forms
3 ~ o u n t i n g ot,ll simpl e altd compou nd ta.gs, the Brow11 Corpus ta.gset colrt,ains 187 t.ags.
4 A useful overview
of
th e rela.t.io11
of
these a ~ ~ c lt,ller t.agsets
to
each otller alrd
1 0
t.l~e rown
Corpus
t,agset is given
i n Appendix
B
of [Ga rsi de et, a1
19871.
8/15/2019 Penn Treebank Paper Fulltext
6/25
of d e t h e base forin
DO),
he past tense
DOD),
nd the third person singular present (DOZ),
and eight forms of be-the five forms distinguished for regular verbs as well as th e irregular forms
a m BEM), re ( BER) and
w a s
(BEDZ). By contrast, since the distinctions between the forms of
V B
on
the
one hand a nd t he forms of BE, DO and H V on the other a.re lesically recoverable, they
are eliminated in the Penn Treebank, as shown in Table
1.'
Table 1:
Elimination of lexically recoverable distinctions
,4 second example of lesical recovera.bility concerns t.hose words that ca.n precede articles in
noun phrases. Th e Brown Corpus assigns a. separa.te tag t,o pre-qualifiers
q ~ r i f e ,
. n t h ~~=
uch ,
pre-quantifiers (~~ 11,zcllf, rizn~zy, z n r y ) .nd both Th e Peiin Treehank, on t,lle ot.her ha.nd, assigns all
of these words to a single category PDT predeterniiner). Further esaniples of lesically recoverable
categories are t he Bro\vn Corpus ca.tegories PPI, (singu1a.r reflesive p ronoun) and PPL S (pl ura l
reflexive pronoun) , which we collapse with R (personal l~ ronoun). .nd the Brown Corpus ca.ttegory
RN (nominal adverb ), wllich we collapse rvit.11 RB (a ,dverh ).
Beyond reducing lexically recovera.ble distinctjons ,
we
also elimina.t,ed certa.in POS distinctions
that are recoverable with reference to synta,ct,icst,ructure. For inst.a.nce, 'he Pen11 Treeba,nk ta,gset
does not distinguish subject pronouns frolll object pronouns eve11 in ca.ses where t he distinction is
not recoverable froin the pronoun's form, as with
y
0 2 1
since the distillction is recoverable
11
the
basis of the pronoun's position in the parse tree in th e parsed version of the corpus. Simila.rly,
the Penn Treebank ta.gset confla.tes subordina,t ,ing conjunctions with preposit.ions. t,agging both
categories a.s
I N .
The distinctioil betlveen the two categories is not lost , Irowever, since subordinating
conjunctions can be recovered as those instailces of I N that precede clauses, \+'herea.sprepositions
are those instances of
IN
t11a.t precede noun phrases or prepositional phrases. CVe would like to
emphasize t ha t th e lesical and syntactic recovera.hility inherent in the POS-ta.ggetl version of t he
Penn Treebank corpus allows end users t o employ
a
nluch richer tagse t t han t he snlall one described
in section 2 .2 if the need arises.
2 1 2 Consistency
As noted above. one reason for elimiilatiiig a P O S tag such a:, RE (n on ~i na l tiverb) is it5 lexical
recoverability. Another important reasoil for doing so is consistency. For instauce. in the Bro~vn
T h e
gerund and t.he pa.rt,iciplr of d o are tagged \ BG a.ntl \ BN i l l the Brown (:'orl>~~s,especti\rely -presunrably
because they are never
used
as auxiliary verbs.
tr h
irregular present tense forlns
urn
and
or e
are tagged as V B P itr th e Penti 'Treel,a.nk (see sectiou 2 . 1 . 3 . ust
like
any
other non-t,hird person singular present tense forin.
8/15/2019 Penn Treebank Paper Fulltext
7/25
Corpus , the deictic adverbs there and ihoril are always tagged H B (adv eil) ), whereas their counter-
parts h e r e and t h en are iiicoilsistently tagged as RB (adv erb) or RN (nominal adverb)-even in
identical syntacti c conte sts, such as after a preposition. It is clear that reducing t he size of the
tagset reduces t he chances of such tagging incoilsistencies.
2.1.3
Syntactic function
A
further difference between the Penn Treeba.iik and the Brown Corpus concerns the significance
accorded t o syntactic context. In th e Brown Corpus, words tend t o be ta.gged independently of
thei r syntac ti c f ~ n c t io n .~or inst ance, in the phra.se the one, one is alwa,ys ta.gged as CD (cardina,l
number ), whereas in the corresponding plural phrase the ones, ones is a1wa.y~a.gged as NNS (pl ural
common noun), despi te the parallel functioil of on e and ones as 1iea.d~ f their noun phra.se. By
con tra st, since one of t he main roles of the ta.gged version of t.he Penn Treebank corpus is t o serve as
th e basis for a bracketed version of th e corpus, we encode a word s syntac tic function in its P OS tag
whenever possible. Thus, one is ta.gged as N N (singiilar commou no un ) ra.ther th an a.s C D (cardinal
number) when it is th e llea,d of a
11 ~11
lrrase. Silnila,rly, while the Br o~ vnCorpus tags both as ,4BX
(pre-quantifi er, double conjunction) , rega.rdless of \vl~etlrer t functions a,s a, prenoininal nlodifier
(both the boys), a postnoininal inodifier
t h e
boys. Both), the head of a lroun phra.se (bot h of the
boys) or par t of a, coillples coord ina.ting coi ijul ictioi~ 60th 1~og.s i ~ l
i r l .5).
tlre Peiln Treeba,nk ta.gs
both differently in each of these syntactic contests---as PDT (predetermines),R 13 (a.dverh)?NNS
(plu ral common noun ) and coordina,tiilg conjuuctiou
(CC ) .
respectively.
Th er e is one ca.se in which our concern wit11 ta,gging
by
synta.ct,icE~lllct~iolra,s led u s to bifurca.te
Brown Corpus ca,tegories rath er tlmn to collapse them: namely. in the case of t he uninflected fornz
of verbs. W11erea.s th e Brown Corpus tags th e ba.re form of a verb as V B regardless of whether
it occurs in
a
tensed clause. the Penn Treeba.nk tagset dist,inguishes
\ jB
(illfinitive or imperative)
from V B P (non-th ird person singu1a.r present tens e).
2 1 4
Indeterminacy
A
final difference between the Penii Treebauli tagset and all othe r tagset s \ve are a\va.re of concerns
th e issue of indetermina.cy: both POS a.mbiguity in the t,est and a~ lll ota tor ncertainty.
I11
nrany
cases, POS anlbiguity can be resolved with reference to t l ~ eiilgiiistic cont,e st. So, for instance, in
Katherine Hepburn s witty line
Grnizt
c an be otrbsl~okt.18-but not by nr7,yorae 1 know, the presence
of the by-phra.se forces us to coirsider o-cl,ts11oX.cn
s
the past pa.rticiple of a, tralisit ive deriva.tive of
spea,k-outspeak-rather than a,s th e adjective outspoken. However, even given explicit criteria for
assigning POS tags t o potentially airrbiguous words, it is not a1wa.y~ ossible to a.ssign a unique t ag
to a word with confidence. Since a. ma.jor concern of th e Treeba,nk is a,voitl requir ing a.nnota.tors
to make arbitra .ry decisions, we allow words t o be a.ssocia.ted with more th an one
P O S
t,a.g. Such
inultiple ta.gging indicates either t.1la.t the word s pa rt of speech simply calrnot he decided or t1ia.t
th e a nnota tor is unsure wlriclr of t.he a,lternative t,ags is the correct one. In principle, annota.tors ca.n
7 ~ nmportant except.ion is there which the Brown Corpus t.ags as
E X
(esist,et~t,ialh e r e ) when it
is
used as a
formal subject. and as
R B
(adverb) when it
is
used
as
a locative
adverb
I n
t h c .
ca.sc of tl~trr vr tlitl not, pursne
our st rat egy of tagse t recll~ct.ion o it.s logical cor~ clr~ siou. hich wortltl 11avr irllplied t a g g i ~ ~ gsist.e~~l.ialh t r t a i\ N
(cortlmon 110~11).
8/15/2019 Penn Treebank Paper Fulltext
8/25
tag
a
word with ally number of tags, but in pra,ctice, multiple tags a.re restrict ed t o a small number
of recurring two-ta.g combinations:
J
J NN (adjective or lloun as prenominal modifier), JJ
VBG
(adjective or gerund /present participle), JJIVBN (a.djective or past pa.rticiple), NNlVBG (noun or
gerund), and
RB
lRP (adverb or particle).
2 2 The P S tagset
The Penn Treebank tagset is given in Table
2.
It c~l l ta~ins
6
POS tags and 12 other tags (for
punctua tion and currency symbols). A detailed descriptioil of the guidelines governing the use of
the tagset is available in [Santorini 19901.'
C C
C D
D T
E X
F W
IN
J
J
J JR.
J JS
LS
M D
NN
NNS
N N P
NNPS
PDT
POS
P R P
PP$
RB
RBR
RBS
R P
S Y M
Table 2:
Th e Peiln Treel)a.nli POS tagset
Coordina.ting conjurlction
2.5.
Ca.rdina1 ni~nlber 26.
Determiner
27.
Existential t h r
28.
Foreign word 29.
Preposition/subord. colljli~lctioll 30.
Adjective 31
Adjective, comparative
32.
Adjective, s~iperla~tive 33.
List item ma.rker 34.
Modal 3.5.
Noun, singular or mass
36.
Noun, plural 3 7 .
Proper iloun, singular
38.
Proper noun, plural 39.
Predetermiller
40 .
Possessive ending .
Persona.1 pronoun
42.
Possessive pronoun
-13.
Adverb
4 4 .
Xdverh, comparative -1.5.
Adverb, superlative
4 6 .
Pa.rt,icle -17.
Syrnbol ma.thematica,l or scientific)
.IS.
TO
ty
l1
\ I3
V B D
IJBC4
VBN
VBP
V BZ
\I'D
T
I\.
P
If.
P $
CfiR
B
$
t o
u t
etjectiolr
\/erl,, ha.se form
Verb, pa.st t,ense
Verb, gerund/present pa.rticiplc
Verb, pa.st pa,rticiple
Verb, no]\-3rd 11s. sing. present
Verh, 3rd ps. sing. present
uyh-deterlniner
ur
11-pronoun
l'ossessive u 1-pronoun
.to
h-a.clverb
Pound sign
Dollar sign
Sent.ence-fina.1pullctuation
(>omma
Colo11, semi-colon
Left bracket cha.ra.cter
R.ight bracket cha.racter
Straight, double
quote
Left
ol)cn
siilgle quote
Left opeu doul~ le u0t.e
Rigl-~t lose single cluote
Right close clouble quote
'1n
versions of t he t,a.gged corp us dist,ri buted before Nol,emher
1992.
singolar proper uoulls, plura.1 prope r nouns
and personal prono uns were t,a.ggecl as NP , NPS a.nd PP , respect,irely. Th e curr ent t,ags
N N P . KNPS
and
PRP
were int,rocluced in ord er to avoicl confusion .rvit,h th e synt.act.ic t.ags
KP
noun phrase) ant1
PP''
(prepositional phra se) (see Table 3 .
8/15/2019 Penn Treebank Paper Fulltext
9/25
2 3
The
P S
tagging process
Th e tagged version of t he Penn Treebank corpus is produced in two stages, using a coinbination of
autom atic POS a,ssigilme~lt nd manual correction.
2 3 1 A u t o m a t e d s t a g e
During th e early st a.ges of t he Penn Treebank project, the initial a.utoma.tic POS assignment wa.s
provided by PARTS ([Church 1988]),a stochastic algorithm developed at ATSLT Bell Labs. PARTS
uses a. modified version of the Brown Corpus ta,gset close to our ow11 and a.ssigi1s POS t,a.gs wit,h
an error rate of 3-5%. Th e output of PARTS was automa.tically tokenized%ild th e ta,gs assigned
by PARTS were automa,tica.llymapped onto the Penn Treeba,nk ta.gset. This ma.pping iiltroduces
about 4% error, since the Pen11 Treebank tagset ma.kes certain distinctions tlmt the PARTS ta.gset
does not.'' A sample of th e resulting ta.gged text , which 11a.s an e rror ra, te of 7-9%, is sho~vnn
Figure 1
Fi g u re 1:
Sainple ta,gged t.est-before correction
hifore recently, the antoma,tic POS a.ssignmei~ts p~*ovided
y
a ca.scade of stocha.stic and rule-
driven taggers developed on the ba.sis of our early expel-ience. Since t,hesc ta.ggers are b a e d on th e
Pent1 Treebank ta,gset, the 4 error ra.te introduced as a.n a.rtefa.ct of ma.pping from the PARTS
tagset to ours is elinlina.ted, a,nd we obt,aiii error rates of '2-CiTI.
2 3 2 Manual
correc t ion
st ge
The result
of
th e f irst , a.utoma.ted sta.ge of POS tagging is given
to
allllotators t o correct. The
annota to rs use a mouse-ba.sed package writ tell i n GNU Elllacs Lisp. \vliich is embedded ~v it hi n he
'111 contrast to the Brown
Corpus,
we tlo not, a.llow compountl tags of t.he sort illustra.t.ed above for I m .
Rather, co~ltractions nd the Anglo-Sason genitive of I I OU I I S are au tomat ica.11~ plit into their co~rlp oneutmorphemes,
nd
each morplreme is t.agged separat,el\r. Tlrl~s.
hildrerz s
is t.aggetl clr ild re~~ /NN S
s /POS
and toon l is t.agged
wo-/MD n't/RBX.
''The two largest. hources of rna.pping error are t. l~ at he
PARTS
tagset, tlistinguishes neither infinitives from the
11011-third person singular present t,ense forms of \zerbs. 1101. prepositions f r o l ~ ~al.ticles
n
case> like
r ~ i n
r t l l a~rt l
r u n
u p
bill.
8/15/2019 Penn Treebank Paper Fulltext
10/25
G N U
Emacs editor ([Lewis et a1 19901). The package allows ann ota tor s to correct POS assignment
errors by positioning th e cursor on an incorrectly ta.gged word and then entering the desired correct
ta g (or sequence of multiple tags) . Th e a,nnota.tors' nput is a.utomatically checked a.gainst th e list of
legal tags in Table 2 an d, if valid, appended to the original word-ta.g pa.ir separated by a.n asterisk.
Appending the new ta,g ra the r than replacing t he old tag allows us t o ea.sily identify recurring
errors a t t he automa-t ic PO S assignment stage. Me believe t11a.t the confusioll matrices t11a.t can
be extracted from this illformation should also prove useful in designing better a,utonia.tic aggers
in the fut ure . Th e result of this second sta,ge of PO S tagging is shown in Figure
2
Finally, in the
distribution version of the tagged corpus, a ny incorrect tags a.ssigned at th e first, autorna.tic stage
are removed.
Figure
2:
Sample tagged text -af ter correction
Tlre learning curve for the
P O S
tagging ta.sk t,akes under a niont,l~ a.t
1.5
hours a. week). a.nd
a.llnotation speeds after a. 111011th exceed 3,000 \vorcls per hour,.
Two
modes of annotation an experiment
To
determine 11ow t,o ~llasill lize he speed, inter-a.unot.atorconsistency and accuracy of POS ta.gging,
we performed a.n esperiment a.t the very beginning of the project to compare t,wo alternative modes
of annotation. In the first allllotatioil mode ( ta.gging ), annota.tors tagged una.nnota,ted te st
entirely by hand; in the second mode ( correct ing7'), they verified a,nd corrected t he outpu t of
PARTS, modified as described above. This experiment showed tha,t ma.nual t,a,ggillg took about
twice as long a.s correcting, with ahout twice the int,er-a~~not,at,orisagree~nent at,e and a.n error
ra te tha t was ab ou t .500i;, 11igher.
Four annotator s,
all
with gra.dua.t,e rai~lil~g
~ r
inguist,ics. part.ici])at,etl in th e experinlent. All
completed a. tra.ining sequence consisting of fifteen hours of correcting. fo llow~d y six hours of
tagging. Th e training nla.terial \\-as selected from a va.riety of nonfictio~lgenres in tlle Brown
Corpus. All t he annota.tors were familiar w i t h G N U Emacs at the outset of the experiment. Eight
2,000 word sa.mples were selected fro111 the Brown Corpus, t, ~v oa.cl1 from four different genres (t wo
fiction, two nonfiction), none of which t he a.nnota.tors 11a.d encountered in t rain ing. Th e tes ts for
the correction ta,sk were autoillatically ta.ggetl as described in sect,ioll 2 3 Eac.11 aanotador first
8/15/2019 Penn Treebank Paper Fulltext
11/25
manually tagged four texts and t hen corrected four auto~llatically a.gged texts. Each a~l ilo tat or
completed the four genres in a different permuta.tion.
A repeated measures analysis of annotation speed with annot ator identity, genre and annota tion
mode (tagging vs. correcting) as classification variables sho\ved a significant annotation mode effect
(p .05). No other effects or interactioils were significant. Th e average speed for correcting was
more than twice as fast as the average speed for tagging: 20 nlinutes vs. 44 minutes per
1,000
words. (Median speeds per 1,000 words were 22 vs. 42 minutes.)
A simple mea.sure of tagging collsistency is inte r-an nota~ tor isagreement ra te , t he ra.te a.t which
annota tor s disagree with one another over t he tagging of lexical tokens, expressed a.s a perceiltage
of the raw number of such disagreements over the number of words in a given tex t sample. For
a given text and
n
annot ator s, there a.re ) disagreeinent ratios (one for ea,cb possible pa,ir of
annota tor s). Meail inter- annota tor disa,greeuuent wa,s 7.2 ) for the tagging ta.sli a.nd
4.1
for the
correcting ta,sk (with illedians 7.2 and
3.6 ,
respect ively). [Tpoii esa.mina. tion, a disproport ionate
amount of disa.greement in t he correcting case was found to I>e ca.used by one te st th at contained
many insta.nces of a, cover synlbol for cl~e~il ical,ild ot.lle~*orninlas. tlie alxence of an esplicit,
guideline for tagging this case. the a.nuota.tors 1a.d ~na.
8/15/2019 Penn Treebank Paper Fulltext
12/25
Bracketing
4 1
Basic
~ne t hodo l o g y
Th e methodology for bracketing the corpus is colnpletely parallel t o that for tagging-11a.nd cor-
rection of th e ou tpu t of an errorful automa,tic process. Fidditch,
a
deterministic pa.rser developed
by Donald Hindle first a.t the University of Pennsylva.nia and subsequently a.t AT&T Bell Labs
([Hindle 19831, [Hindle
1989]),
is used to provide a.n initial parse of the ma,terial. Ailnotators then
hand correct th e parser's o ut pu t using a mouse-based interfa.ce implemented in GNU Elnacs Lisp.
Fidditch h as thre e properties th at make it ideally suited to serve as a preprocessor to hand correc-
tion:
Fidditch a1wa.y~ rovides exa,ctlv one a.na.lysis for any given sentence. so th at a.nnota.torsneed
not search through illultiple analyses.
Fidditch never a,tta.chesa.ny collstituellt whose role in th e 1a.rger str uc ture it ca,nnot determine
with certainty.
In cases of uilcerta.inty, Fidditch chunks the input into
a
string of trees,
providing only a par tial str uct ure for ea,ch seittence.
Fidditcll has rather good grammatical coverage. so th at t 1 1 ~ rarn~nat ical l i ~ t l l i~ha,t it does
build are usually quite accurate.
Because of these properties, a.nnot,a.t,ors o not need t,o rehra.cl;et 111uc1i of th e parse r's ou tput-
a relatively time-consuming ta.sk.
Ra.t,l~er,he a.nnota.tors' lllaiil ta.sl; is to glue together the
syntact ic chunks producetl hy the pa.rser. Using a, mouse-based interfa.ce. annotat ors move each
una ttac hed chunk of stru ctu re under t,he node to wllich it should be a tta.ched . Notational devices
allow a.nnota.tors t-o intlica.te uncerta.inty concerning const,it,uent. la.bels, a,nd to indica.te lnult iple
at tac hment sites for ai-nbiguous modifiers. The bra.cketing process is descri l~ed n more detail in
section
4.3.
4 2 Th e
syntactic tagset
Table shows the set of synta.ctic tags and null elellleiits t>l ~a te use in our slielet,al hra.cket,ing.
More deta.iled informa,tjoil on th e synt,a.c,tic ,a.gset a.nd guitlelines concerning its use a.re to be found
in [Santorini a#r-tdMarcinkie~vicz19911
8/15/2019 Penn Treebank Paper Fulltext
13/25
Table 3:
Th e Penn Treebank syntactic ta.gset
Tags
ADJP
ADVP
NP
P P
S
SBAR
SBARQ
SINV
SQ
V P
WHADVP
WHNP
WHPP
X
Adjective phrase
Adverb phrase
Noun phrase
Prepositional phrase
Simple declarative clause
Clause introduced by subordinating conjuilctioi~or (see below)
Direct question introduced by wh-word
or
wh-phrase
Declarative sentence with subject-aux inversioil
Subconstituent of SBARQ excluding uih-word or wh-phrase
Verb phrase
Il'h-adverb phrase
Ii;lz-noun phrase
Ifl'h-prepositional phrase
('oiistituent of u~lkllow~lr uncertain category
Null elements
1 ..Understood subject of infinitive or imperative
2 0 Zero variant of h n t in subordinate cla~lses
3
T Trace-marlis positioti where moved t~h-cons ti tuents interpreted
4. NIL
Marks positioil where prepos itioi~ s in teipret rd in pied-piping con tests
Although different in detail, our tagset is si1nila.r
in
delic,acy t.o t,lxat used by the La.ncaater
Treebank Pro ject , except that we allow null elements in t,lle syntact ic annota,t ion. Bemuse of the
need t o achieve a fa,irly high output per hou r, it wa.s decitletl not to require a ,il ~lota torso create
distinctions beyond those provided by t he parser. Our a.pproach to cleveloping th e syiltactic tagset
was highly pra,gmat.ic and stroilgly influenced by th e need t.o crea.te a, 1a.rge body of ai111ot.a.ted
material given limited liurnai~ esources. Despite th e slceletal natu re of
t,he
hra,clieting, however, it
is possible to malie quite delicate distinctions ~vhen sing the corpus
I>
searching for combina.tions
of structures. For esample, a,n SBAR cont,a.ining the word to immedia.tely before the VP will
necessarily he infinitival. wllile aa SJJAR couta.ining a, verb or auxiliary \vitll a, tense feature will
necessarily be tensed. To take another esa.niple, so-called thc~,t-c1a.11sesa.n be ider~tifiedeasily
by
searching for SBARs conta,ining the ~vord
t r r t
or the
111111
ele~llent i l l initial position.
As can be seen froru Tahle 3 the synt.actic tagset used by the Pelill Tretha.nk i~rcludes variety of
ijull elements, a. subset of the null e le ~l le nt ,~ntroduced hy Fitiditch. While i t \vould be espeilsive to
insert null elerne~lts ntirely y hand, it llas not proved o\.erly onerous to nlainta.in a.nd correct those
th at are autorna.tica1ly provided. We have clioseil t,o ret,a.in t.llese null ele~nell ts eca,use we believe
that they caa be exploited in many ca.ses t,o esta.11lish a, sen tei~ce 'spredicate-ilrgument structure;
8/15/2019 Penn Treebank Paper Fulltext
14/25
at least one recipient of the parsed corpus has used it to l~ootst~raphe tlevelopment of lexicons for
particular N P projects and has found the presence of null elements to be a considerable aid in
determining verb transitivity (Robert Ingria, personal communication). While these null elements
correspond more directly to entities in some granllnatical theories thail in others, it is not our
intention t o lean toward one or another theoretical view in producing our corpus. Rather, since the
representational framework for
gramnlatical struc ture in the rreebank is a relatively i~npoverished
flat context-free notation, the easiest mechanisn~ o include information ahou t predicate-argument
structure , although indirectly, is by allowing t he parse t r to contain explicit null items.
4 3 ample bracketing output
Below, we illustrate the bracketing process
for
the first sent.ence of our sa,mple text . Figure shows
the ou tput of
Fidditch (modified slight,ly to i nc lude onr P O S t,a,gs).
8/15/2019 Penn Treebank Paper Fulltext
15/25
Figure :
Sainple bracketed test- full s t ruc t l i re provided by Fidd i t ch
(S
(NP (NBAR (ADJP (ADJ Battle-tested/JJH)
(ADJ industrial/JJ1')
(NPL managers/NNSH)))
( ? (ADV here/RB1')
( ? (ADV always/RB ))
(AUX (TNS * ) )
(VP (VPRES buck/VBP1'))
( ? (PP (PREP up/RPtl)
(NP (NBAR (AD J nervous/
(NPL newcomers/NNS )))))
( ? (PP (PREP with/INU)
(NP (DART the/DTU)
(NBAR
N
tale/NNU))
(PP O~/PREP
(NP (DART the/DTU)
NBAR (ADJP
(ADJ first/JJH))))))))
( ? (PP of /PREP
(NP (PROS their/PP$ )
(NBAR (NPL countrymen/NNS ))))
( 7 (S (NP (PRO * ) )
(AUX to/TNS)
(VP (V visit/VB1')
(NP (PNP Mexico/NNP )))))
( ? (MID It,/, ))
( ? (NP (IART l8a/DT )
(NBAR (N boatload/NN )
(PP of/PREP
(NP (NBAR
(NPL warriors/NNSU))))
(VP (VPPRT blown/VBN
( ?
(ADV ashore/RB ))
(NP (NBAR (CARD 375/CD1')
(NPL years/NNSU)
) ) ) )
? (ADV ago/RBH))
( ?
(FIN
. / . I ) )
A s
Figu re shows. F iddi tch 1ea.ves
v e ry m a n y
c o n s t i t u e n ts u n a t ~ t a c h e d .abel ing
them a s . ? ..nd
i t s o u t p u t is perhaps be t t e r t hough t o f a s
a
st rin g of tre e fragment,̂ tallan
a s a singl
t l ee s t r u c t u r e .
8/15/2019 Penn Treebank Paper Fulltext
16/25
Fidditch only builds s tru ctu re wlleil this is possible for a purely syntact ic parser without a.ccess to
semantic or pragmat ic information, and it always errs on the side of cau tion. Since determining
the correct a tta chm ent point of prepositiollal phrases, relative cla.uses, and adverbial modifiers
almost always requires extrasyn tactic informa,tion, Fidd itcl ~ ursues the very conservative stra,tegy
of
a lways
leaving sucll collst ituents una, ttached, eve11 if only one atta,chrnent point is synta.ctically
possible. However, Fidditch does indicate its best guess concerniug
a
fragment s atta.chn1ent site by
th e fragment s depth of embedding. Moreover, it a.ttaclles prep osit iol~ al hrases beginning with of f
th e preposition immediately follows
a
noun; thus, t o l e of . a.nd
boatload o f
are parsed a.s single
constituents, while first of is not. Since Fidditch lacks a large verb lexicon, it cannot decide
whether some coilstitueilts serve as adjuncts or a.rguments a.nd hence lea.ves subordinate cla.uses
such as infinitives as separate fragments. Note further t ha t Fidditch creates a,djective phra,ses only
when it determines tlia,t more tha n one lexical item belollgs in the ADJP . Finally, a.s is well known,
determining the scope of conjunctions and other coordinate structures ca.n only be determined
given th e richest forins of coiltextual inforination: here aga.in. Fiddit,ch si ~npl y urns out a, string of
tree f rag men ts a.rounc1 ally conjunction. Beca.use all tl~cis ions cithin Fitltlit,cl~ .re una.tle locally. all
c o m m a (which often signa,l co~ijunc tion)mlist disrupt tllc input illto sepava.te chultlis.
The original design of tlle Treebaali called for a level of syntactic ana.lysis compara.ble t.o th e
skeleta l ana.lysis used
by
th e Lanca.st,er Treeba.lrli. but a. linlit,ed esp er i~ ne ntwas performed early in
the project to investigate t . 1 ~easibility of providing great.er levels of structural tleta.il. While the
results were somewhat unclear, there wa.s evidence tl1a.t ailnot ,ators could lnaintaiil a, much fa.ster
ra te of ha,nd correction if t he pa,rser ou tput wa.s
simplified
in va.rious \va,ys, reducing th e visual
complexity of th e tree represe~lta tions .nd elimina.ting a range of minor clecisions. T he key results
of this esperiment Lvere:
Allnotat ors take substa,ntially longer to learn the I)~.aclieting:.a.sli t.I~an he P O S tagging. ask.
with substa,n tial increases in speed occurring even a . f t e ~,\vo ~non tl ls f trainirlg.
Anilotators can correct the full structure providetl by Fidditch at an average speed of ap-
pros.
37 5
words per hour after three weeks. and
475
nords per hour after
is
weelis.
Reducing th e out pu t from the full structur e shotvn in Figure to a more skeletal represen-
tati on similar t o tha t used by t he Lanca.ster IJCREL Treebank Project increases a,nnotator
productivity by appros. 100-200 words per hour.
It proved to he very difficult for a .nno tators to tlistingrrisll hetween
a
verb s a,rguineirts and
a.djuncts in all ca.ses. Allo~ving nnota.to rs to ignore this dist.inction when it is unclear (a.tt,ach-
ing coilstituents high) increases product,ivit,v 11y appros. 1.50-200 n:ords per liour. Inforrna,l
exarrlillation of later a.nnota.tion showed that forced diht.itlctions carlnol, be nra,tIe consist,ent,ly.
,4s a resul t of this experiment., t.he origina.11,~ roposetl slieletal represent,ation was adopted ,
without a forced distinctioll between arguments a.nd at1.juncts. Even aft,er estended training, per-
forma.nce wr ie s ma.rkedly by a~nnota .tor , ith speeds on the ta.sk of correcting ~ke le t~ altructure
without requir ing a, distiilctioll bet\veen arglilneilts a.nd a.tljuncts ra.nging from appr os . 75 ivords
per hour to well over
1 000
words per hour a.ft,er t.hree or follr rnont,hs experience. The fa.stest
annota, tor s work in burstas of well over 1..500 words per liour alte~.na~ingvit,ll I~rief est,s. At an
8/15/2019 Penn Treebank Paper Fulltext
17/25
average rate
of
750 words per hour, a team of five pa.rt-t,imea.nnotators a.nnota.ting three hours a
day should mainta,in a.n ou tp ut of a.bout
2 5
nlillion words a yea,r of treebanked sentences, with
each sentence corrected once.
It is worth noting that experienced annota tors can proofread previously corrected ~n at er ia l t
very high speeds.
A
parsed subcorpus of over one inillion words
wah
recently proofread at an average
speed of approx. 4 000 words per ailnotator per hour. A t this rat e of proiluctivity, anno tator s are
able t o find and correct gross errors in parsing,
b u t
do not have time to checli, for example, whether
they agree with all prepositional phrase at ta ch me ~~ ts .
Th e process th at creates the skeletal represenfations to be corrected by the anilo tator s si~llplifies
and flat tens th e struc tures shown in Figure by removing
POS
tags, non-bra.nching lexical nodes
and certain phrasal nodes, notably
N R A R .
Th e outpu t of the first a,utoma.tedsta.ge of t he bra.cketing
task is showri in Figure 4.
8/15/2019 Penn Treebank Paper Fulltext
18/25
Fi g u re 4:
Sample bracketed tes t-aft er simplification, befo re correctioil
S
(NP (ADJP Ba t t le - t es te d i nd us t r ia l )
managers)
? h e r e )
? always)
(VP buck) )
? (PP up
(NP nervous newcomers)))
? (PP with
(NP t h e t a l e
(PP of
(NP t h e
(ADJP f i r s t ) ) ) ) ) )
?
(PP of
(NP t h e i r countrymen)))
? S
(NP * )
t o
(VP v i s i t
(NP Mexico)
1) )
?
,>
? (NP a boatload
(PP of
(NP warriors))
(VP blown
? ashore )
(NP
375
y e a r s ) ) ) )
? ago)
? > I
Annotators correct this si~nplified truc ture using a mouse-based interface. The ii prinla.ry job is
to glue fragments toget her, hut they illust a.lso correct incorrect parses and tlelet,e some structure.
Single mouse clicks perforin the following ta.sks, a.illong ot,hers. Th e interfa.ce correctly reindents
th e s tru ctu re \vlleilever necessary.
Attach constituents labeled
?.
'This is done
bj
pressing do~ vu ,he appropria.t e mouse button
on or immedia.tely after the ? moving th e lllouse ont,o or immedia.tely a.fter the label of the
intended parent a,ild releasing the mouse. .i\t,ta.ching const ituents ai itoma~ticallvdeletes t.lleir
? label.
Promote a co~lstitiieiltup one level of structu~e,making it a 5ihling of its cun.eut parent.
8/15/2019 Penn Treebank Paper Fulltext
19/25
Delete a, pair of collsti tuellt brackets.
Create a pair of brackets around a constituent. This is done by typing a constituent tag and
then sweeping out the intended constituent wit11 th e mouse. Th e ta.g is checked to a.ssure
that it is a legal label.
Change th e label of a const ituent. The new tag is checked to assure t,hat it is legal.
The bracketed test after correction is showil in Figure 5. The fraginents are now coilnected
together into one rooted tree structure . Th e result is a skele t l analysis in tha.t much syilta.ctic
detail is left unannot ate d. Most prominently, all internal st ruc ture of th e NP up through the head
and including a.ny single-word post-head modifiers is left uilan~lota,ted.
Figure 5
Sa.mple bracketed test -aft er correctioli
S
NP B a t t l e - t e s t e d i n du s t r i a l m anagers
h e r e )
always
VP buck
UP
NP nervous newcomers)
PP with
NP t h e t a l e
PP of
NP NP t h e
ADJP f i r s t
PP of
NP t h e i r countrymen)))
S NP * )
t o
VP v i s i t
NP Mexico))))
NP NP a bo at lo ad
PP of
NP NP wa rr io rs )
VP-1 blown
ashore
ADVP NP 375 y e a r s )
a g o > > > > >
V P - 1 *ps e udo - a t t a c h* ) ) ) ) ) ) ) )
8/15/2019 Penn Treebank Paper Fulltext
20/25
As noted above in connection with POS ta.gging, a nlajor g o d of the Treebank project is to
allow annot ato rs only t o indicate struc tur e of which they were certain. Th e Treebank provides two
notatioila l devices t o ensure this goal: the coilstitueiit label and so-called pseudo-attachment .
The
X
const ituent la,bel is used if a.n a,ni lotator is sure t lmt a sequence of words is a majo r con-
stituent but is unsure of its syntactic category; in such ca.ses, the annotato r simply brackets t he
sequence and labels it
X.
Th e second nota, tiona l device, pseutlo-a. ttachinent , l1a.s two prima.ry uses.
On t he one ha.nd, it is used t o an notate what Kay has called pe rn ~r .n en t rec1ictn.ble c~,n~biyu,ities,
allowing a n annota .tor to indicate tha,t a, st ru ct i~ res globa.lly ambiguous even given the surrounding
context (anno tat ors always assign struct ure to a, sentence on the ba.sis of it s context). An esa.mple
of this use of pseudo-attachment is shown in Figure
5 ,
where the participial phrase blown ashore 75
years ago modifies either warriors or boatload,
hu t
there is no way of settling the question-both
attachmen ts mean exa.ctly the sa,me thing. In the ca,se a.t ha.nd, the pseudo-atta.chment notation
indicates th at the a.nnot ator of the sentence thought tha t VP 1 s most likely a modifier of warrio rs,
but th at i t is also possible t ha t it is a modifier of b ~a t l on d. ' ~ second use of pseudo-attachment is
to allow ann ota tor s to represent the underlying position of extraposed elements: in addition to
being a,ttached in its superficia.1 position in the tree. the est,raposetl coi~stituent,s pseudo-a,t,t,ached
within the constituent, to which it is semantically rela.t,ed. Note that except for the device of pseudo-
at tachment , the slieleta,l analysis of th e Treebank is entirely 1,cstrictecl t,o sirnple cont,est-free trees.
Th e reader ma.y ha,ve noticed tl1a.t the
AD J P
bra.cltets in Figuse 4 have vanishetl in Figuse
5.
For
the sa,ke of th e overa.11 efficiency of the annot,a.t, ion ,ask.
we
leave
all AD.11
hra,cliets in the sinlplified
str uct ure , with the ai lnotators esyectetl t o reinove many of then1 during a.nnotation. The reason
for this is some1vl1a.t~ ornples, but provides a good esa.mple of t he considerations tha.t come into
play in designing the det,ails of annot,a.tion methods. Th e first relevant, fact is tlmt Fidditch only
outputs ADJP
bra.ckets \vit,llin NPs for aclject,ive phra.ses containing more t11a.n one lesical item.
To be consistent, the fina.1 str uct ure nlust coiltaill AD J P ~iocles or all adjective phrases withill
NPs or for none; we ha.ve chosen to delete all sucli nodes \vit, l~inNPs under noriual circ~lmsta.nces.
(Thi s does not affect th e use of the ADJP ta g for predica.tive a.djective phrases outside of NPs.)
In
a seemingly unrelated guideline, all coordiilat,e st,ructures are a.ii~rota.t.edn the 'Treehank; such
coordinate structures a.re represented by C:lio~nskp-adjunctionwhen the two col~joiiiedconstitlients
hear the same la,bel. This illeans t,hat
if
an
YP
contains coordinatect a.djective phrases, then an
AD JP ta g will be used to t,a.g ,ha.t coorcliila~tioileven though si~npleADJPs witlii~lXPs will not be
bear an A PJ P ta,g. Esperieilce ha,s sl io ~v i~ha t annot,at .ors ca.11 delet,e pa,irs of bra.ckets estl-emely
quickly using the mouse-based tools. whereas crea,ting brackets is a, m1.1c11 lower opera.t,ion. Because
th e coordination of a.djectives is quite common, it is more efficient to lea.ve in ADJP labels, and
delete them
if
they a.re not part of a coordinate structure , t1ia.n to reint.roduce them if necessary.
Progress
to date
5 1
Coinpositioil and size of corpus
Table 4 shows the output of the Penn T1.eel1auli project a t the
elid
of its first pllase.
12~l l i s
se of pseudo-at t achnlel~t.s
identical
to it.s
oi.igi~ialus i n Cl1u1.clt s
pitrser
[Clru
rch l J R O ] ) .
8/15/2019 Penn Treebank Paper Fulltext
21/25
Table
4:
Penn Treebank
(a s of 1 1/92 )
Tagged for Skeleta l
Description Part of Speech Parsing
(Tokens) (Tokens)
Dept. of Energy abstracts
Dow Jones Newswire stories
Dept. of Agriculture bulletins
Library of America texts
MUC-3 messa.ges
IBM Ma,nual sentences
W B U R radio tra.nscripts
ATIS seiltences
Brown Corpus, reta.gged
Total:
4.885.781; 2,fiX1,11(8
All the materials listed above are availa.ble on CD R.OM to mernber-s of the Linguistic Data
Consortium.13 About inillioil words of POS- a.ggecl ina.teria,l and a small sa.nlpling of skeleta.11~
parsed text are a,va.ila,blea.s part of the first Association for C omputa.iiona1 Linguistics/Da.ta. Col-
lection Initiative CD-RO M, and
a
some~vll at arger snbsct of nrat,eria.ls s ava.ilable on c x tr id ge t ap e
directly from the Penn
Treebank Project,. For
i n
fo1.111at o~r. orrt
art t
l st a.l~t,I~orf t,llis pa.per
or send ernail to treeI~ai1li~Bunagi.cis.1i~~e1~11ed11.
r
Dept . of Energy abst ra,ct s are scientific abst,ra.ctsfrom a variety of disciplines.
All of t he skelet,ally parsed Doxv Jones Newswire 1na. teria.l~ .re also a.va,ila,ble as digitrally
recorded rea.d speech as pa.rt of the DL4RP.4 WSJ-CS RI corpus, a.vaila.ble through the Lin-
guistic Data. C onsort,iurn.
r The Dept. of Agriculture materials iilcludc short 1)ulletjns such
nlien to plant various
flowers and how to can various vegetables and fruits.
r Th e Library of .4iuerica t es ts are 5,000-10,000 word passage>, mainly booli c hapte rs, from
a variety of Arnericall authors includillg Nal-li Twain. I-Ienry i-ldams. Willa Ca ther , Herman
hilelville. W.E.B. Dubois, and Ralph M-aldo Emerson.
The
M U C 3
tests are all news stories fro111 the Federal Nelcs Service about terrorist activities
in South America. Some of these t es ts are translations of Spanish news stories or transcripts of
radio broadcasts. They are taken from training lnaterials for the :3rd hlessage IJnderstanding
Conference.
13Cont,actThe Ling11ist.i~
a.ta
Consortil~m. 41 l17illianls Ilall. I~'nivc.r>it - l'P rtl r~s yl~. ani a.'hilatlelpllia, PA 19104
6305 or
send email
to ltlc~Ci~~~~ra gi ci~ r~l~enn ed~~or ulore i~~f or r ~~a t io n.
8/15/2019 Penn Treebank Paper Fulltext
22/25
Th e Brown Corpus inaterials were coinpletely retagged
111
the Penn Treebank project sta rtin g
from the untagged version of th e Brown Corpus ([Francis 19641).
The IBM sentences are taken from IBhil colllputer ma~nuals; hey are chosen t o coiltain a
vocabulary of 3,000 words, and are limited in length.
Th e ATIS sentences ar e transcr ibed versions of sponta.neous sentences collected a.s training
materials for the DARPA Air Tra,vel Information Systenl project.
Th e entire corpus ha,s been ta,gged for POS i nforn~a.t ion,.t a.n est ima ted error ra.te of a.pprox. 3 .
Th e POS- tagged version of th e Libra.ry of America te xts and the Depa.rtment of Agricul ture bul-
letins have been corrected twice (ea.ch by a. different ann ot at ~o r) ,.nd t.he corrected files were then
carefully a.djudica.ted;w e est inlate the error ra te of the a.djudica.ted version a.t well under
1%.
Using
a
version of PARTS retrained on the entire prelimina.ry corpus a.nd adjudicating between the out-
put of the retrained versioil and the prelimii~ary ersion of the corpus, we plan to reduce the error
ra te of the final version of t,he corpus t,o appros . 1%. All the skeletally pa,rsed ma.t,eria,ls1ia.ve been
corrected once, except for the Brown inat,erials. ~vhich11a.ve been quic.lloyed has a. va.riety of
limitations. Many ot,herwise clear ar gu ni en t/ ad j~ ~~ ~c t~elations in the corpus are not indica.ted due
to the current Treeba.nli7sesseiltia,lly contest- free represe11ta.t-ion.For example. there is a.t present,
no satisfa.ctory represent.a.tion for seilt e~ices n \vl~iclicoluplelnent, nolln plira.ses or clauses occur
after a sententia,l level a.dverb. Either the adverb is trapped within tlie
\. P,
o th at the coillpleillent
can occur within th e VP where it belongs. or else tlie adverb is a.t,ta.ched to tlle S, closil~goff
the V P a.nd forcing the conlplenlent to attach to the S . This trappiilg problem serves a.s a
limitatioil for groups t1ia.t currently use Treebank mat,erial to semiant ,oi~~a.ti ca~l lyerive lexicons for
part icular applica.tions. For most of these prohlerns. hoirrevcr. solutions a,re possible on t he basis of
mechanisms already used by the Treebaak Projcct For esa.mple. t,he pseudo-att ,acllment nota.tion
8/15/2019 Penn Treebank Paper Fulltext
23/25
can be extended t o indicate
a
variety of crossing dependencies. IVe have recently begun to use this
mechanism t o represent va,rious kinds of disloca.tions, a,nd t,he Treeha.nk aalnota.tors t,hemselves have
developed a detailed proposal t o extend pseudo-a.tta .chment to a wide range of simi1a.r phenomena.
A
variety of ir~consistencies n the annotat ion scllerne used lvithin t he Treebank have also become
apparent with time. Th e annotat ion schelnes for some syntactic categories should be unified to
allow a consistent approach to determining predicate-argument struc ture . To take a very simple
example, sentential adverbs a tta ch under
V P
when they occur between ausiliaries and predicative
ADJPs, but attach under
S
when they occur between ausiliaries and VPs. These structures need
to be regularized.
As the cur ren t Treebank has been exploi ted by a va.riety of users, a, significaat number have
expressed a need for forms of a11nota.tion richer tha n provided by t he project s first pha.se. Some
users would like a less skeletal form of a.nnota.tion of surface grammatical struct ure , espa.nding
th e essentially context-free a.nalysis of the current Penu Treebank to indicate a, wide va,riety of
non-contiguous struct,ures a,nd depende~lcies.4 wide rangc of Trceba.nli users now st.rongly desire
a level of allllotatioil \chic11 ma.kes explicit some form of pretlicat P-a rg ~~ tr lcnttruct,ure. Th e desired
level of representat ioil ~vould ualie explicit th e logical sl.ihjert
a.11t1
logical object of the verb, a.nd
would indicate, a.t least in clear ca.ses, which sr11)c.onst.it.uents erve as ;\.rgulilents of the underlying
predicates a.nd wl1ic11 serve as modifiers.
During the nest, phase of tlle Treebaak project, we expect to provide both a, richer a.na1ysis of tlle
existing corpus a.nd to provide a pa,ra.llel corpus of pretlica.te-argument st,rlrctures. This will be done
by first enriching the a .nnotation of the current corpus, a.nd then a.utomatica.lly est,ra.ctingpredica.t,e-
argument structure: a t t he level of dis tinguishir~g ogical sul)jcct,s ant1 ol>jects. and distirlguisllir~g
arguments from a,djunct,s or c1ea.r ca,ses. Enri ch~nen twill he acllieved by a~rtomatcallg t,ra,nsforming
the current Pel111 Treeha.nk into a. level of s t r u c t ~ ~ r elose to t 11eii~t.eildetI ,a.rget.a.ncl then completing
th e coilversion by hand .
eferences
[Brill 19911
[Brill et a1 19901
[Church 19801
[Church 19S8]
Brill? Eric. 1991. Discovering t,he lcsica,l fea.tures of a.1angua.ge.
I11 Proceedirzgs of tlzc- 29th .-l~l~ztrcrl2Ieeti12g of tlzt il.s.socintioiz
or
Con ,p~ r t~ ~ . t i oncr li~zgtr i s t ics .B ~ I - k c l e y :
C4
Brill, Eric: h lagernlau, Da.vid: A~larcus.Mitchell P.; and San-
torini, Beatrice. 1990. T)etlucing li~~gliistictructllre from the
s~ at is ti cs f large co1.1,ot.a.
11
P1.ocvrclilrg.5 o j lh c V 4
RPA
,Speec/,.
nnd .\T~rl~r~~rr/f~ .~ ig l rc~ f l tI1ork6ho~).I ~ I I ? . c
9.90.
pages 27.55282.
C:hurch. Iienneth
C
1950. hlemory limitations in 11a.tura.l
1a.ngua.geprocessing, k1IT
L C:S
Tecl1njca.l Report
24.5.
h/la.ster s
thesis, Ma.ssac11usetts Ins tit jute of Technology.
Cll~urch, iennet11 \1 ., 1985.
4
stocl~astic arts pr0gra.m a.nd
llour~p11ra.se pa.rser for unresl.ricted test ,.
In
I rocc-etlirzg.5
o
8/15/2019 Penn Treebank Paper Fulltext
24/25
[Francis
19641
[F ranc is an d K uE era
19821
[Ga rsid e et a.1
19871
[Hindle 19831
[Hindle
19893
[L ewis e t a
19901
the Second Conferenc e on Applied Nuturul Lunguuge Process-
ing. 26th An nu al M eeting of tlze Associcr.tioi2 or Co input ation a~l
Linguis t ics ,
p a g e s 136-143.
Francis W . Nelson
1964.
A standard sample of present-day
English for use with digital comp uters . Report to the U.S Of-
fice of Edu,cation on C ooperative Researc h Project N o.
E 007.
Brown University Providence.
Francis W.
Nelson a.nd I
8/15/2019 Penn Treebank Paper Fulltext
25/25
[Santorini an d Marcinkiewicz 19911 Santorin i Bea trice
a.nd
hla.rcinkiewicz Mary Ann 1991.
Bracketing guideli~les or t he Pen11 Treebank Pro jec t.
Ms.
Depart men t of Comput er a.nd 1nforma.tion Science University
of Pennsylvania..
[Veilleux and Ostendorf 19921
Veilleux N. M. and Ostendorf Mari 1992. Probabilistic parse
scoriilg based on prosodic features. In Proceedings of the Fifth
D A R P A Speeclz and ATatura,r Language T,lfork.shop, Febr vary
1992.
[Weischedel et a1 19911 Weischedel Ralph; Ayuso Dama.ris; Bobrow
R.;
Boisen
Seaan; ngria R.obert; and Pa li~lucci eff
1991.
Pa.rtia1 pa.rs-
ing:
a
report of work in progress. In
Proceeclings of the Fourth
D A R P A Speech and iVattrra,l Lnngirage W or ks ho p, February
1991,
Drajt version.