The bigger the better? Corpora (co-) builder John Sinclair | This article orginally appeared in the Nov-Dec 1991 issue of Language Industry Monitor By Andrew Joscelyne The advent of massive text corpora, comprising many tens of millions of words, is having a profound effect on the language industry. John Sinclair, Editor-in-Chief of the innovative Cobuild Dictionaries and pioneer in the use of text corpora in lexicography in the UK, recently explained to the Monitor what the excitement is all about. Using a corpus of texts in language processing is nothing new. Even the smallest sampling of the utterances, (speech acts, sentences, texts, documents etc) that comprise the models for language engineering applications are drawn from some concrete usage and must be called a corpus, however arbitrary and highly constrained it might be. Statistical analyses of large machine-readable corpora, however, can provide a far richer source of raw linguistic data for building a dictionary or developing a parser. Text corpora can also be useful test-beds for evaluating new applications. John Sinclair: “Attitudes to large text corpora have grown far more positive recently. These corpora are even beginning to change our whole attitude to language. The availability of very large bodies of text and the possibility of recording, transcribing, and rendering machine-readable spoken language is having a profound impact on the actual study of language. “Back in the mid 1980s, the first UK Alvey Project on information technology specifically stated that it was not interested in corpora. The Americans only started to take the notion of very large machine-readable corpora seriously two years ago. Unfortunately, most linguists do not accept that for describing language their intuitions are perfectly genuine but provide terribly partial evidence for language behavior. “But now, decisions being taken in the European Commission and elsewhere suggest that the use of large corpora is finally being taken seriously. Even so, you find even highly corpus-minded people saying ‘If I find a conflict between the evidence and my intuition about usage, I‘ll trust my intuition!‘ Corpora are acknowledged to be useful tools but predominantly used to justify pre-corpus models of language. I believe we should adopt a new attitude of humility towards the evidence that the corpora are throwing at us: we do not really know what language is like, so let us study the corpora and see what they teach us. But it will take us at least another decade to develop an adequate methodology for exploiting their riches.” “When we began Cobuild, a dictionary based on a databank of 7m words of English and designed for the international community of English speakers, we didn‘t assume that it would change our idea of language. We thought that using a corpus would simply provide documentary evidence to fill some of the blanks in our knowledge of word usage. We now realize that linguists have been starved of this kind of data. The situation was rather like early astronomers who worked without instruments. Any old theory can get you going but new information may require you to radically rethink your position. “In the traditional study of semantics, the meaning of words is usually analyzed independently of what I call the textual environment. Yet given the kinds of phenomena we are finding, this now seems like a quite bizarre method of studying language. Because language patterning is almost infinitely diverse, you can confirm virtually any reasonable theory. You need to reach the stage where the corpus suggests the theoretical model to you rather than simply using the corpus to verify an intuition. “In determining ‘standard‘ usage, frequency is obviously an important parameter of plausibility. When producing a dictionary, you find that it is particularly difficult to describe a word if you have less than ten or so occurrences since clear context patterns do not emerge. At Cobuild, we realized that the 7m corpus was insufficient for accurate statistical analysis and upped it to 20m words. Our online Bank of English now contains 100m words of very diverse spoken and written English. A shift from 20m to 100m words, for example, allowed some interesting patterns of complementation to emerge for certain classes of English verbs. However, there are grounds for thinking that only with a 200m word corpus will the full range of grammatical patterns emerge. “Take that class of English adjectives, including veritable and nice, which is regularly preceded by the indefinite article. Nice is probably one of the most common adjectives in the language, yet has some very odd constraints. It is regularly preceded by the indefinite article, it is regularly modified by very, how and rather, and it regularly modifies other adjectives. Or look at the set of necessarily modified nouns. (ie, He‘s a good husband but *he‘s a husband). These examples all need to be analyzed related to other classes to discover the patterns. If you spend enough time analyzing what you have, a corpus-guided approach should provide us with a more accurate description of the language than any cognitive modelling approach can offer. “If you use a corpus, you must not undervalue anything you find in it. In other words, if you choose a corpus of quality newspapers for your base, then you must accept what it says, not what you might think it should say. This is particularly important in the case of the new corpora of spoken language. My prediction is that people will use them for the right reasons but they will reject half of what they find there. The corpus will be used simply to validate an a priori view of the language. “At Cobuild, for example, we strive to offer a model of international English, so we throw in a range of books, newspapers, and spoken language, from the US and Australia as well as the UK, excluding children‘s speech and poetry. In the original statistics, the word ‘nymphet‘ turned up fifty times, due to the inclusion of Nabokov‘s Lolita. This is a skewed picture of current English. But because we committed ourselves to this approach, we had to accept the consequences.” “If you use anything other than complete texts, your corpus will also be skewed in other, less interesting ways. With the Genghis Khan approach, where you try to capture ‘averages‘ by chopping selections of twenty pages from the beginning, middle and end of various types of documents, you are inevitably betraying language structure. Language simply doesn‘t come in even size samples, it comes in documents, so that‘s how you need to use it. Once again, it would be surprising to discover that position in a document was NOT a significant feature for language patterning. In fact, we can demonstrate what a text is about at any given point on the basis of inter-collocation frequencies. “To a large degree, the question of how many words have you got in your corpus is becoming ridiculous. A finite corpus, such as we have all been developing, would have no sense in the long run, other than as a archive. Since vast amounts of text are now generated in machine-readable form, it is far more interesting to use this as a steady flow of new data, pouring it over a range of pre-set filters which extract relevant information and throwing the text away. We call this the monitor corpus approach. The question is: what should be retrieved from this endless flow of language as it streams through the processing unit? “With a monitor corpus, we see that if a word is continually used in a certain environment then it develops a ‘normality of occurrences‘ which can be checked by frequency analysis. By monitoring the stability of these norms, we can add a diachronic (historical) dimension to the study of language patterning study, something which is impossible with a sampling-type finite corpus. “In general, I see the corpus as a potential source information about a language, with dictionaries and grammars as our access tools. A dictionary should no longer be the stable authority it is when a language is first being written down. Rather, it should be the analysis of a given corpus. Once a dictionary has been developed on the basis of evidence from the corpus, it should provide all the information you need to understand a text in that language. “There is an interesting spinoff at this stage: if you can understand the dictionary, you can understand the language. We are developing a parser that can understand our Cobuild dictionary (which is intentionally written in English sentences rather than in lexicographer-ese). It is easier to parse a dictionary than a language. And once we‘ve parsed the dictionary, we should be able to parse — and therefore understand — the language. The success of the parser will depend on the adequacy of the dictionary, so the parser will be an excellent measure of the overall adequacy of the dictionary.” COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR
|