Corpus Linguistics
description
Transcript of Corpus Linguistics
Corpus linguisticsan introduction
ENG 447
Key pointsBasic notions historical development: two competing approaches Types of corpus Exploiting a corpus Resources
Basic notions Corpus: A collection of naturally occurring language text, chosen to characterise a state or variety of language (Sinclair)
A collection of linguistic data, either written text or a transcription of recorded data, which can be used as starting-point of linguistic description or as a means of verifying hypotheses about a language (Dictionary of linguistics and phonetics)
What is a corpus? Large body of evidence typically composed of attested language use (McEnery)
Usually a corpus is in machine-readable format and is ideally viewable and analysable through (a single) software package
The word corpus comes from Latin body and the plural is corpora
“If it happens once, you don't know anything. If it happens twice, it suggests further investigation. If it happens three or more times, then you have something to write about!”
History We have to split the history in two periods: before Chomsky and after Chomsky
Before Chomsky, methods similar to the ones in corpus linguistics were used (empiricism)http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus1/1fra1.htm
Early corpus linguistics Before Chomsky Computers were not available so it was difficult to analyse large collections of text Studies of child language using diaries kept by parents Spelling conventions in a German corpus of 11 million words Foreign language pedagogy
Early corpus linguistics (II) All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions: The sentences of a natural language are finite. The sentences of a natural language can be
collected and enumerated. Most linguists saw the corpus as the only source of linguistic evidence in the formation of linguistic theories
Chomsky Between 1957 and 1965 Chomsky changed the direction of linguistics from empiricism towards rationalism
“Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, other because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list” (Chomsky, 1962)
Introspection started to be used instead
Problems with introspection
Naturally occurring data is observable and verifiable by everyone. Introspective data is artificial. Human beings have only the vaguest notion of the frequency of a construct or a word.
The revival of corpus linguistics
The research in corpus linguistics was continued in small centres The hardware still imposed some restrictions, the real development will start in the 80s Fields like computational linguistics were not interested to use corpora
Fillmore’s description of the two approaches
The corpus linguist : " He has all the primary facts that he needs, in the form of a corpus of approximately one zillion running words, and he sees his job as that of deriving secondary facts from his primary facts. At the moment, he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus the second word of a sentence.”
The "armchair " (introspective) linguist : "He sits in a deep soft armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, ‘Wow, what a neat fact!’, grabs his pencil, and writes something down… having come still no closer to knowing what language is really like."
Goals of corpus linguisticsChomskyan linguistics ‘Langue’
(competence) Ideal speaker/hearer Language = innate
mental faculty Intuitive evidence Universals Grammar
Corpus linguistics ‘Parole’
(performance) Complexity/variation Language = social
phenomenon Empirical evidence Differences Meaning
Types of corpora
Written vs SpokenGeneral vs Specialised e.g. ESP, Learner corporaMonolingual vs Multilingual e.g. Parallel, ComparableSynchronic vs Diachronic; MonitorAnnotated vs Unannotated
Written corpora Brown
LOB
Time of compilation 1960s
1970s
Compiled at Brown University (US)
Lancaster, Oslo, Bergen
Language variety Written American English
Written British English
Size 1 million words (500 texts of 2000 words each)
Design Balanced corpora; 15 genres of text, incl. press reportage, editorials, reviews, religion, government documents, reports, biographies, scientific writing, fiction
Specialised corpora CSPAE
CHILDES
Time of compilation 1990s
Since 1980s
Compiled at / by Michael Barlow (Rice Univ)
Project started at Carnegie Mellon Univ; contributors worldwide
Language variety Spoken professional American English
20 languages, incl.: E.Asian, Germanic, Romance, Slavic…; mainly conversational data;
Size 2 million words (tagged)
c. 20 million words (growing)
Design Transcripts from professional settings (meetings, conferences…) by 400 speakers; academia (1 M) politics (1 M wds)
“Child language data exchange system”, offering transcripts of monolingual and bilingual children’s language (language acquisition data)
COMPILED AT LANGUAGE SIZE DESIGN First generation major corpora Brown Corpus (1960s)
Brown Univ, US Written American English 1 million (tagged)
15 genres of text: press reportage, religion, fiction…
Second generation mega corpora Bank of English (since 1991)
COBUILD, Birmingham Univ
Written / spoken English 450 million – year 2002 (tagged)
Monitor corpus; mostly written: newspapers, books; spoken: conversations, broadcasts, interviews...
International Corpus of English [ICE-GB] (1990s)
UCL, London Written / spoken British Engl.
1 million (grammatically parsed)
One of 15 projects worldwide preparing different national / regional varieties of English; 200 written, 300 spoken texts, various genres
Specialised corpora Corpus of Spoken Professional American Engl. [CSPAE] (1990s)
Rice Univ, US Spoken American English 2 million (tagged)
Transcripts from professional settings (meetings, press conferences) by approximately 400 speakers, centred on activities tied to academics and politics
Learner corpora International Corpus of Learner English [ICLE] (Since 1990s)
Louvain Centre for English Corpus Linguistics, Belg.
Engl. writing by learners of from 19 mother tongue backgrounds, incl. Chi.
Over 2 million Essay writing by advanced learners of English as a foreign language
Non-English monolingual corpora HK Cantonese Adult Corpus [HKCAC] (2000)
Dept Speech & Hearing Sci’s, HKU
HK Cantonese 170,000 characters Spontaneous speech recorded from phone-in radio programs and forums, by 69 speakers
Multilingual / Parallel corpora International Telecommunications Corpus [ITU / CRATER] (1995)
CRATER project (Corpus Resources & Terminology Extraction) Lanc U.
French, English and Spanish
1 million tokens in each language (tagged)
Trilingual parallel corpus from telecommunications domain; aligned at sentence level
Other examples of available corpora
Ways to exploit a corpus Word (token) / types frequency lists N-grams Concordances Collocations/collegations Specially designed programs (especially when the corpus is annotated)
Frequency lists are lists which indicates the words which appear in a corpus and their frequency they provide a survey of the corpus a frequency list becomes more meaningful when compared with other lists they remove a word from its contexts
N-grams groups of N words which appear in sequence in the text they are presented using frequency lists good way to identify recurring/specific expressions for a corpus provide limited context for the words
Concordances show words in the context they appear usually they are obtained using special programs which allow to manipulate the lists of concordances KWIC (Key Word In Context) is the most common format
Example of concordance output (from MonoConc)
Langue - Parole
famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. <p> With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium.
Parole
syntbagmatic
Langueparadigmatic
Collocations collocation = the occurrence of two or more words within a short space of each other in text the collocates are extracted using a window to the left and right of a specified word can be used to further analyse the context of a word
http://www.sketchengine.co.uk/sampler/
What can we do with a corpus?-- Two broad approaches
Corpus-based approaches: hypotheses are checked against a corpus Corpus-driven approaches: hypotheses are drawn from the corpus
-le is a separate morpheme for the concept of future.
Find all occurrences of “le” in the wordforms of the corpus.
Posit a hypothesis
Concordance
Test hypothesis
Testing new hypothesis
Distribution of subject personal pronounsacross registers
0
20000
40000
60000
80000
100000
120000
140000
Conv Fict News Acad
Register
Freq
uenc
y pe
r millio
n wor
ds
theyweithe/sheyouI
Comparison of individual modal and semi-modal verbs in conversation and academic prose (based on LGSWE, Table 6.6, p. 489)
0
1000
2000
3000
4000
5000
6000
Freq
uenc
y pe
r million
wor
ds
Conversation Academic prose
Fields where corpora are used
Lexicography to design dictionaries Language studies (relations between languages, differences between genre, evolution of the language) Computational linguistics (training and testing methods) Language teaching (learner’s corpora) Cultural studies, psycholinguistics
Web as a corpus The Web can be very useful source of texts The Web is very helpful for languages other than English Quite often there is not control on the language which is investigated therefore filtering (if possible) is necessary
Existing corporaBrown Corpus/LOB corpusBank of EnglishWall Street Journal, Penn Tree Bank, BNC, ANC, ICE, WBE, Reuters CorpusCanadian Hansard: parallel corpus English-FrenchYork-Helsinki Parsed corpus of Old PoetryTiger corpus – GermanCORII/CODIS - contemporary written ItalianMULTEX 1984 and The Republic in many languages
ReferencesKarin Aijmer and Bengt Altenberg (1991) English corpus linguistics, LongmanDuglas Biber, Susan Cnrad and Randi Reppen (1998) Corpus linguistics, Cambridge University PressGraeme D. Kennedy (1998) An introduction to corpus linguistics, LongmanTony McEnery and Andrew Wilson (1996) Corpus linguistics, Edinburgh University Press
Online resources语料库语言学在线
http://www.corpus4u.com/语料库语言学与英语教育教学 http://sfs.scnu.edu.cn/corpus4u/
ConCapp
http://panda.nhce.edu.cn/corpus4u/tools/concord/wconcord.rarBNC : http://sara.natcorp.ox.ac.uk/lookup.html
CLEC: http://www.clal.org.cn/corpus/ChiSearchEngine.aspxCHILDES: http://childes.psy.cmu.edu/