Cambridge Language Survey: new multilingual corpora


This article orginally appeared in the Sep-Oct 1992 issue of Language Industry Monitor

By Andrew Joscelyne

Cambridge University Press is in the process of launching the Cambridge Language Survey (CLS), a large-scale, long-term project to collect English, French, German, Spanish, Italian, Dutch, and Japanese text and to develop software for its analysis. Currently in its initial canvassing period, the Survey aims to bring together publishers, academics, and high-technology industries in a number of countries to form an international consortium which can then seek funding from governments and the CEC. The overall objective is to go beyond the current monolingual corpus approach and form a vast “information resource” for language. This would then be made available to anyone needing access to extensive, non-partisan, multilingual linguistic data..
    According to project coordinator Paul Proctor, former editor of the popular Longman Dictionary of Contemporary English and now senior editor for “International Dictionaries” at Cambridge, “what is unusual in CLS is that for once lexicographers and linguists will work closely together. In the past, their interests have usually kept them apart.” He explains that CLS has in part grown out of the success of the recently completed ESPRIT-funded multilingual aquilex 1 project, which linked EUropean publishers and NLP teams (including the neighboring Cambridge Computer Laboratory) in an effort to draw up guidelines for the computerized acquisition of lexical data.
    One of the main objectives of CLS is to create computerized electronic dictionaries for subsequent use as a tool for the analysis of each language. The emphasis will be on the semantic dimension of words, which entails developing a more extensive linguistic coding set than is currently available. “We shall concentrate on collocations and the statistical analysis of the company words keep,” says Proctor. “This means building up very large databases.”.
    Ecumenical and eclectic, the CLS is currently seeking additional participants, with the hope of obtaining the widest possible coverage for data acquisition. Rather than competing with any of the existing language resource collection projects, CLS will work in concert with them, extending their findings and avoiding duplication of effort.
    As for the ambitions of his organization, Paul Proctor suggests that the Press does have one or two specific publishing ideas up its sleeve which ought to benefit from the data that CLS gathers – a new learner’s dictionary of English being one of them.

Cambridge University Press, Edinburgh Building, Shaftesbury Rd, Cambridge CB2 2RU, UK; Tel +44 223 325 880, Fax +44 223 315 052, Email psp10@phx.cam.ac.uk

COPYRIGHT © 1992 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]