Corpus Linguistics

Corpus linguisticsan introduction

ENG 447

Key pointsBasic notions historical development: two competing approaches Types of corpus Exploiting a corpus Resources

Basic notions Corpus: A collection of naturally occurring language text, chosen to characterise a state or variety of language (Sinclair)

A collection of linguistic data, either written text or a transcription of recorded data, which can be used as starting-point of linguistic description or as a means of verifying hypotheses about a language (Dictionary of linguistics and phonetics)

What is a corpus? Large body of evidence typically composed of attested language use (McEnery)

Usually a corpus is in machine-readable format and is ideally viewable and analysable through (a single) software package

The word corpus comes from Latin body and the plural is corpora

“If it happens once, you don't know anything. If it happens twice, it suggests further investigation. If it happens three or more times, then you have something to write about!”

History We have to split the history in two periods: before Chomsky and after Chomsky

Before Chomsky, methods similar to the ones in corpus linguistics were used (empiricism)http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus1/1fra1.htm

Early corpus linguistics Before Chomsky Computers were not available so it was difficult to analyse large collections of text Studies of child language using diaries kept by parents Spelling conventions in a German corpus of 11 million words Foreign language pedagogy

Early corpus linguistics (II) All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions: The sentences of a natural language are finite. The sentences of a natural language can be

collected and enumerated. Most linguists saw the corpus as the only source of linguistic evidence in the formation of linguistic theories

Chomsky Between 1957 and 1965 Chomsky changed the direction of linguistics from empiricism towards rationalism

“Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, other because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list” (Chomsky, 1962)

Introspection started to be used instead

Problems with introspection

Naturally occurring data is observable and verifiable by everyone. Introspective data is artificial. Human beings have only the vaguest notion of the frequency of a construct or a word.

The revival of corpus linguistics

The research in corpus linguistics was continued in small centres The hardware still imposed some restrictions, the real development will start in the 80s Fields like computational linguistics were not interested to use corpora

Fillmore’s description of the two approaches

The corpus linguist : " He has all the primary facts that he needs, in the form of a corpus of approximately one zillion running words, and he sees his job as that of deriving secondary facts from his primary facts. At the moment, he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus the second word of a sentence.”

The "armchair " (introspective) linguist : "He sits in a deep soft armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, ‘Wow, what a neat fact!’, grabs his pencil, and writes something down… having come still no closer to knowing what language is really like."

Goals of corpus linguisticsChomskyan linguistics ‘Langue’

(competence) Ideal speaker/hearer Language = innate

mental faculty Intuitive evidence Universals Grammar

Corpus linguistics ‘Parole’

(performance) Complexity/variation Language = social

phenomenon Empirical evidence Differences Meaning

Types of corpora

Written vs SpokenGeneral vs Specialised e.g. ESP, Learner corporaMonolingual vs Multilingual e.g. Parallel, ComparableSynchronic vs Diachronic; MonitorAnnotated vs Unannotated

Written corpora Brown

LOB

Time of compilation 1960s

1970s

Compiled at Brown University (US)

Lancaster, Oslo, Bergen

Language variety Written American English

Written British English

Size 1 million words (500 texts of 2000 words each)

Design Balanced corpora; 15 genres of text, incl. press reportage, editorials, reviews, religion, government documents, reports, biographies, scientific writing, fiction

Specialised corpora CSPAE

CHILDES

Time of compilation 1990s

Since 1980s

Compiled at / by Michael Barlow (Rice Univ)

Project started at Carnegie Mellon Univ; contributors worldwide

Language variety Spoken professional American English

20 languages, incl.: E.Asian, Germanic, Romance, Slavic…; mainly conversational data;

Size 2 million words (tagged)

c. 20 million words (growing)

Design Transcripts from professional settings (meetings, conferences…) by 400 speakers; academia (1 M) politics (1 M wds)

“Child language data exchange system”, offering transcripts of monolingual and bilingual children’s language (language acquisition data)

COMPILED AT LANGUAGE SIZE DESIGN First generation major corpora Brown Corpus (1960s)

Brown Univ, US Written American English 1 million (tagged)

15 genres of text: press reportage, religion, fiction…

Second generation mega corpora Bank of English (since 1991)

COBUILD, Birmingham Univ

Written / spoken English 450 million – year 2002 (tagged)

Monitor corpus; mostly written: newspapers, books; spoken: conversations, broadcasts, interviews...

International Corpus of English [ICE-GB] (1990s)

UCL, London Written / spoken British Engl.

1 million (grammatically parsed)

One of 15 projects worldwide preparing different national / regional varieties of English; 200 written, 300 spoken texts, various genres

Specialised corpora Corpus of Spoken Professional American Engl. [CSPAE] (1990s)

Rice Univ, US Spoken American English 2 million (tagged)

Transcripts from professional settings (meetings, press conferences) by approximately 400 speakers, centred on activities tied to academics and politics

Learner corpora International Corpus of Learner English [ICLE] (Since 1990s)

Louvain Centre for English Corpus Linguistics, Belg.

Engl. writing by learners of from 19 mother tongue backgrounds, incl. Chi.

Over 2 million Essay writing by advanced learners of English as a foreign language

Non-English monolingual corpora HK Cantonese Adult Corpus [HKCAC] (2000)

Dept Speech & Hearing Sci’s, HKU

HK Cantonese 170,000 characters Spontaneous speech recorded from phone-in radio programs and forums, by 69 speakers

Multilingual / Parallel corpora International Telecommunications Corpus [ITU / CRATER] (1995)

CRATER project (Corpus Resources & Terminology Extraction) Lanc U.

French, English and Spanish

1 million tokens in each language (tagged)

Trilingual parallel corpus from telecommunications domain; aligned at sentence level

Other examples of available corpora

Ways to exploit a corpus Word (token) / types frequency lists N-grams Concordances Collocations/collegations Specially designed programs (especially when the corpus is annotated)

Frequency lists are lists which indicates the words which appear in a corpus and their frequency they provide a survey of the corpus a frequency list becomes more meaningful when compared with other lists they remove a word from its contexts

N-grams groups of N words which appear in sequence in the text they are presented using frequency lists good way to identify recurring/specific expressions for a corpus provide limited context for the words

Concordances show words in the context they appear usually they are obtained using special programs which allow to manipulate the lists of concordances KWIC (Key Word In Context) is the most common format

Example of concordance output (from MonoConc)

Langue - Parole

famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. <p> With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium.

Parole

syntbagmatic

Langueparadigmatic

Collocations collocation = the occurrence of two or more words within a short space of each other in text the collocates are extracted using a window to the left and right of a specified word can be used to further analyse the context of a word

http://www.sketchengine.co.uk/sampler/

What can we do with a corpus?－－ Two broad approaches

Corpus-based approaches: hypotheses are checked against a corpus Corpus-driven approaches: hypotheses are drawn from the corpus

-le is a separate morpheme for the concept of future.

Find all occurrences of “le” in the wordforms of the corpus.

Posit a hypothesis

Concordance

Test hypothesis

Testing new hypothesis

Distribution of subject personal pronounsacross registers

0

20000

40000

60000

80000

100000

120000

140000

Conv Fict News Acad

Register

Freq

uenc

y pe

r millio

n wor

ds

theyweithe/sheyouI

Comparison of individual modal and semi-modal verbs in conversation and academic prose (based on LGSWE, Table 6.6, p. 489)

0

1000

2000

3000

4000

5000

6000

Freq

uenc

y pe

r million

wor

ds

Conversation Academic prose

Fields where corpora are used

Lexicography to design dictionaries Language studies (relations between languages, differences between genre, evolution of the language) Computational linguistics (training and testing methods) Language teaching (learner’s corpora) Cultural studies, psycholinguistics

Web as a corpus The Web can be very useful source of texts The Web is very helpful for languages other than English Quite often there is not control on the language which is investigated therefore filtering (if possible) is necessary

Existing corporaBrown Corpus/LOB corpusBank of EnglishWall Street Journal, Penn Tree Bank, BNC, ANC, ICE, WBE, Reuters CorpusCanadian Hansard: parallel corpus English-FrenchYork-Helsinki Parsed corpus of Old PoetryTiger corpus – GermanCORII/CODIS - contemporary written ItalianMULTEX 1984 and The Republic in many languages

ReferencesKarin Aijmer and Bengt Altenberg (1991) English corpus linguistics, LongmanDuglas Biber, Susan Cnrad and Randi Reppen (1998) Corpus linguistics, Cambridge University PressGraeme D. Kennedy (1998) An introduction to corpus linguistics, LongmanTony McEnery and Andrew Wilson (1996) Corpus linguistics, Edinburgh University Press

Online resources语料库语言学在线

http://www.corpus4u.com/语料库语言学与英语教育教学 http://sfs.scnu.edu.cn/corpus4u/

ConCapp

http://panda.nhce.edu.cn/corpus4u/tools/concord/wconcord.rarBNC ： http://sara.natcorp.ox.ac.uk/lookup.html

CLEC: http://www.clal.org.cn/corpus/ChiSearchEngine.aspxCHILDES: http://childes.psy.cmu.edu/

Corpus Linguistics

Documents

Transcript of Corpus Linguistics