What is a corpus? What is corpus linguistics?

Some definitions

corpus, plural corpora A collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies. Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts.
(cf. Crystal, David. 1992. An Encyclopedic Dictionary of Language and Languages. Oxford, 85)

CORPUS (13c: from Latin corpus body. The plural is usually corpora) (1) A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse. (2) Plural also corpuses. In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analysed by means of tagging (the addition of identifying and classifying tags to words and other formations) and the use of concordancing programs. Corpus linguistics studies data in any such corpus.
(cf. McArthur, Tom "Corpus" , in: McArthur, Tom (ed.) 1992. The Oxford Companion to the English Language. Oxford, 265-266)


Advantages and drawbacks of using corpora in natural language research - two comments

Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [of language based on the corpus] would be no more than a mere list. (Chomsky, Noam. 1957. Syntactic structures. The Hague, 159)


I have two main observations to make. The first is that I don't think there can be any corpora, however large, that contain information about all of the areas of English lexicon and grammar that I want to explore; all that I have seen are inadequate. The second observation is that every corpus that I've had a chance to examine, however small, has taught me facts that I couldn't imagine finding out about in any other way. (Fillmore, Charles J. 1992. "Corpus linguistics" or "Computer-aided armchair linguistics", in: Svartvik, Jan. (ed.) Directions in Corpus Linguistics. Berlin/New York, 35)