Longman Dictionaries Feed the NLP Community | By Andrew Joscelyne This article orginally appeared in the Jan/Feb 1992 issue of Language Industry Monitor As the original publisher of Samuel Johnson‘s Dictionary, Longman is the oldest British company in the business of managing and marketing language reference material. Della Summers, the company’s dictionary and corpus suprema, explains why Longman Dictionaries prefers to serve the language researcher rather than publish electronic books. According to Della Summers, Director of Longman Dictionaries, “the first stage in electronic dictionary publishing today is uncertainty.” Like other holders of large-scale language reference material, Longman has been tempted by the digitization of its dictionaries without as yet marketing a general public product. Since financing any path-breaking work in language technology necessitates chalking up profit somewhere else, Summers‘ first step is analyzing the risks involved in going digital. At present, it seems, owning the intellectual property of the word hoards does not generate enough revenue to justify any large-scale ventures. “In such market segments as handheld language products,” Summers explains, “the order of precedence for returns is first the distributor, then the hard- and software developers, and, last, the owner of the intellectual property.” In other words, if Longman goes into electronic publishing seriously, it will have to manage the whole range of operations from wiring in the software to marketing the product, rather than simply licensing its property to the computer industry. It was certainly tempted to put at least one of its venerable paper-based language products into a handheld device. Summers says, “I still think that handheld products provide a more attractive publishing medium than PC-based products.” But the company decided that the combination of a new market and a new network for distribution was not worth the risk. “The market,” she adds, “works differently in electronic mode.“ Despite its circumspection, Longman is watching the developing trend in popular electronic language products with interest. Della Summers views the current fashion for Data Discman-type products, as presented with some flourish at last November’s Frankfurt book fair by companies such as Sony, Sanyo, and Panasonic, as the beginning of the “promised explosion“in electronic publishing, but she is not convinced that such products are really electronic books. “As a publisher, I would love to give these electronic people a lesson in design. Their products seem to me to be deeply user unfriendly.“ LDOCE If Longman doesn’t intend for the moment to follow certain of its colleagues into the standalone CD or PC market for word lists, it is making significant advances in the ground work for the actual process of dictionary production. Increasingly, this means building large corpora from which to extract information for a series of steadily more customized end products. In fact, the company’s flagship product is not a corpus of raw material in the obvious sense of the word but a dictionary which can be treated as a linguistic database for a range of further processing activities. The Longman Dictionary of Contemporary English (LDOCE, usually pronounced “eldos“) emerged in 1978 from a general English Language Teaching dictionary publishing program started four years before. “Everything begins with LDOCE,” says Summers. “It is by far the most used product we have.” There are two major reasons for the success of LDOCE as a resource for NLP research. First, the original design specifications include the use of a restricted vocabulary of two thousand words for all definitions; indeed, one of the first tasks that could be automated once the dictionary was on-line was checking that there were no deviations from this Defining Vocabulary (DV) in the text. A researcher at University College Dublin, Cheng-ming Guo, has recently brought out a semantically processed version of this DV called Mini LDOCE. Operating in both Prolog and Lisp environments, it provides a system for translating the DV into a set of semantic primitives. The Mini LDOCE will also reduce a non-DV word used in some definitions to a DV-type translation (eg. aircraft needed in the definition of pilot will receive a [flying machine...] reduction). The second reason is the inclusion of grammatical and semantic information about words in the form of simple codes annotating each definition. When in 1979 the computer tape of the LDOCE was made available to language researchers, the key application was converting these codes into various grammatical formalisms, such as the General Phrase Structure Grammar model developed for NLP research purposes in the Alvey project at Cambridge University Computer Laboratory. Longman has been supplying LDOCE tapes to a wide range of university and commercial research arms, covering everything from machine translation (especially for Japanese teams) and expert systems to neural network research. As for the semantic information, a new set of codes as been developed for the LDOCE. These include restrictions on verb subjects, adjectives, and first and second verb objects, and semantic features, such as degree, function, manner, intention, and sociolinguistic features, which identify register, intention, and user group. Although there have been no official announce-ments, there are strong suggestions that the Longman team has a project up its sleeve to bring out a product based on power searches over these semantic codes. Better dictionaries This project will not necessarily generate a software product, however. Della Summers is quick to emphasize that the constant target for Longman Dictionaries products is a user who is usually not a native speaker of English-that ever-growing community of language learners. “The success of LDOCE as a computer tool is largely a result of the explicit information it gives to learners about the language. Since computers don‘t speak English either, it proves to be an ideal information resource for NLP applications. But machine-readability is not the be-all end-all-our objective is to create dictionaries which get continually better defined.” This means using corpus information and computer-based lexicographic systems to compile increasingly effective or tailor-made paper dictionaries. Longman has taken seriously the corpus approach to providing itself with the right raw material for its dictionary making, even if all corpus specialists do not agree on the company’s design specifications. Longman would claim, in fact, that as the publisher of Dr. Johnson’s famous Dictionary of the English Language, it has been providing an authoritative quote from a real text as an example of word usage for nearly two-and-half centuries. Longman has gradually set up a battery of corpora to serve its various requirements. One of them is based on a textbase of current periodicals to provide data for the Longman Register of New Words. Another more substantial corpus is the Longman/Lancaster English Language Corpus (LLELC), based on a variety of contemporary texts and approaching the thirty million word mark in size. One “selective“part covers fifteen million words of predetermined ratios of text chunks from ten “superfields“and includes a wide range of document features, such as author, occasion, genre, etc. The other “microcosmic“part attempts to provide a control corpus; it contains 15 million words selected from texts listed in Books in Print by using random number tables. The purpose of the LLELC corpus is to extend the kind of service Longman has been providing with LDOCE, namely supplying research organizations in the fields of language research, NLP, and pedagogical studies with a testbed and information resource. To this end, a corpus extraction system has been developed allowing potential customers to specify features sought so that Longman can generate a spoken corpus. Spoken corpora As corpus fever reaches critical mass in the UK, Longman has undertaken yet another corpus project, the British National Corpus (BNC). It will be joining Oxford and Chambers dictionary publishers and Oxford and Lancaster Universities and the British Library in a hundred million word project funded by the Department of Trade and Industry as part of a large-scale UK infotech enhancement program. The BNC is to be delivered in 1994. Longman’s specific contribution to the BNC is a new corpus of which the team, led by Dictionary Resources and Systems Manager Stephen Crowdy, is extremely proud: the Spoken Corpus. Abandoning previous attempts to build a corpus of the spoken language by using the written form of sermons or lectures, Longman decided to record up to ten million words of “real conversational speech“affixing unobtrusive walkman-type recording devices to subjects and retrieving the tapes for eventual transcribing via the keyboard into the database, warts and all. Longman recently completed a 400,000-word pilot corpus by asking fifteen people to record for a week every syllable from their first “good morning“to their final snore. Della Summers says this approach to data collection is “the biggest turn-on for lexicographers since the invention of the pencil.” Together with other data derived from broadcasts and business meetings, it will allow potentially interesting comparisons with written language patterns and will provide information on frequencies for researchers into dialogue management systems etc. If material is continously added to the base over time, revealing diachronic patterns of speech usage might also emerge. Although it means work in hard times, the keyboard operators entering all this raw speech would no doubt welcome the arrival of high quality speech recognition technology. Instead of transcribing the speech to text for concordancing and other operations, a fully digitized spoken corpus would enable a whole range of other NLP and linguistic searches to be carried out, pushing corpus building into a new dimension. Longman Dictionaries Longman House, Burnt Mill, Harlow, CM20 2JE, UK Tel +44 279 426 721, Fax +44 279 431 059 COPYRIGHT © 1992 BY LANGUAGE INDUSTRY MONITOR
|