A survey of some existing corpora and a classification
Some important existing English corpora
- The Brown corpus and the Lancaster-Oslo/Bergen corpus (LOB): Some well-known corpora from the beginnings of the computer age are the Brown corpus of written American English and the Lancaster-Oslo/Bergen corpus of written British English. The Brown corpus was compiled in the 60's, its British counterpart in the 70's. Both consist of around one million tokens (i.e. words, counted every time they appear).
- The London-Lund corpus is another corpus of British English created around that time, but this corpus is different from the Brown and the LOB in that it exclusively contains transcripts from spoken material, collected at the Survey of English Usage at University College London. The London-Lund corpus, the Brown corpus, the LOB and other corpora are now available on CD-ROM as the ICAME collection of English
texts. The International Computer Archive of Modern and Medieval English (ICAME), situated
at Bergen in Norway, offers a wealth of information on these corpora.
Nowadays you will find modern corpora, which differ from those named
above. In the first place, thanks to technological advancements, in particular
faster and more powerful computers, the size of modern corpora is vastly
greater. The British National Corpus, for example, consists of around 100 million
words, i.e. it is a hundred times larger than the Brown corpus! Also, corpus
designers today usually try to include as much spoken material as is financially
and technically feasible. (Remember that creating transcripts of conversations
is a time-consuming and expensive process!) Three examples of modern corpora
are the British National Corpus, which I have just mentioned, the International Corpus of English and the Bank
of English, situated at Birmingham University:
- The Bank of English was initiated in 1991 by COBUILD (a division of HarperCollins publishers) and the University of Birmingham. The main purpose of the Bank of English is and has been to provide a textual database for the compilation of dictionaries and for language studies. The Bank of English is a monitor corpus (i.e. new material is constantly added). By now the corpus has got a size of more than 320 million words.
- The British National Corpus was compiled by a consortium of British publishers, of academic institutions such as Oxford University Computing services, Lancaster University's Centre for Computer Research on the English language and the British Library. It is now a 100 million word corpus of modern British English, both written and spoken, including everyday conversations. It is available on CD-ROM for research purposes; we have got a copy at our department.
- The International Corpus of English (ICE) will ultimately be a collection of 1,000,000 word corpora from each country or region where English is spoken as a first language. The corpus consists of a written and a spoken component. The Survey of English Usage, situated at University College London, is responsible for this project. The
home page of the Survey provides information on a variety of research projects, including the International Corpus of English (ICE).
More links
- The
CHILDES system (mirror of the American site in Antwerp) : This is the
home page for the Child Language Data Exchange System (CHILDES). In particular,
you'll find the CHILDES database, a collection of child language transcript
data from a number of projects in different languages (including English
and German).
- The STELLA project at Glasgow: Here you will get information on COMET (COMputerized English Texts) and access to the project's collection of Scottish texts.
Online corpora
- Experimental
BNC Website: Bad Guys Dont Look : The British National Corpus consortium
currently offers a BNC online service which allows everyone with access
to the internet to register for an account on the BNC server (free for
twenty days unlimited usage)
- Shakespeare
Online Corpus
- Concordance browsing
: This site allows you to search a number of English literary classics,
including the Bronte novels, Shakespeare and James Joyce's Ulyssees,
with the help of the concordance program TactWeb. It is easy to use, even
for absolute novices in the area.
This list of corpora is only a rather subjective selection. For a more
exhaustive list of corpora and other online resources go to English
language corpora and corpus resources .
A possible classification
- medium: spoken corpora (eg. London-Lund corpus) vs. written corpora (e.g. Lancaster Oslo/Bergen corpus(LOB)) vs. mixed corpora (British National Corpus (BNC) or Bank of English)
- national varieties: British corpora (e.g. Lancaster Oslo/Bergen corpus) vs. American corpora (e.g. Brown corpus) vs. an international corpus of English.
- historical variation: diachronic corpora (Helsinki corpus, cf. the ICAME home page) vs. synchronic corpora (Brown, LOB, BNC) vs. corpora which cover only one stage of language history (corpus of Old or Middle English, Shakespeare corpora)
- geographical variation/dialectal variation: corpus of dialect samples (e.g. Scots) vs. mixed corpora (The BNC spoken component includes samples of speakers from all over Britain)
- age: corpora of adult English vs corpora of child English (English components of CHILDES)
- genre: corpora of literary texts vs. corpora of technical English vs. corpora of non-fiction (e.g. news texts) vs. mixed corpora covering all genres
- open-endedness: closed, unalterable corpora (e.g. LOB, Brown) vs. monitor corpora (Bank of English)
- availability: commercial vs. non-commercial research corpora, online corpora vs. corpora on ftp servers vs. corpora available on floppy disks or CD-ROMs