Computational Linguists Going Green

From our correspondent


This article orginally appeared in the Jun/Jul 1991 issue of Language Industry Monitor

At the 5th Conference of the European Association for Computational, held in Berlin in mid-April this year, the cream of European compulinguists turned its collective mind to the ecology-driven question of how to reuse the endless stream of grammars and lexicons produced by generations of graduate students.

Over the years, custom has dictated that doctoral candidates in computational linguistics write a set of rules, usually built around the syntactic permutations of the John/Mary/love scenario but more and more often plunging a bit deeper into the semantics of its states and actions. The exponential growth in such rule-sets (usually much modified over time) is beginning to seem wasteful; in an eco-conscious unified Germany, it was clearly the moment for European ACL’ers to think about how to recycle more than trashed Trabants.
    A key term emerging from discussion was the notion of “theory neutral” NLP dictionaries, something Antonio Zampolli and colleagues from Pisa have been proposing for years. Nancy Ide (Vassar College, USA) reported on the activities of an international group working to define a dictionary structure standard for public domain dictionaries, for both print and computational use. This group opts for a sort of “lowest common denominator” dictionary structure, marked up with SGML tags (clearly its time has come). Other participants, particularly those from the Netherlands, prefer the opposing strategy of choosing the highest common denominator for attribute fields. They feel that dictionary-makers should simply be able to add everything they might know about a given word to the entry structure. Subsets of this maximalist information can then be siphoned off to flesh out other dictionaries.
    Underlying this search for dictionary entry denominators is not such theoretical debate but hard cash. The expense is of particular concern to CEC technocrats. Having nine languages (so far) to engineer into some form of computational equality, they dislike the idea of paying for the same lexicalizing process over and over again. The title of the paper given by Nino Varile (a major figure in the CEC’s Luxembourg-based MT and langtech initiatives) is eloquent: “The Time is Over for Do-It-Yourself Natural Language Processing ” What the industry should be aiming for is less continual re-keying of those same thirty-thousand general terms, and more off-the-shelf dictionaries and grammars based upon current proto-standard conventions.

WHERE PUBLISHERS COME IN
An emerging industry, however, comes up against an age-old, jealous craft. Traditional dictionary publishers in Europe loathe the idea that anyone might reuse even a single entry without paying for it, byte for byte. But Oxford University Press’s Sue Atkins pointed out that the time might be ripe publishers to hire themselves out as electronic dictionary makers to the language industry. Already in March this year, a group of major Euro-dictionary publishers, including Longmans, Hachette and Düden, held a mini-summit in Paris where collaborative initiatives in dictionary format were discussed. In the UK, the British National Consortium (University of Lancaster, The British Library, OUP and Longmans) is looking at the possibility of “pre-competitive” generic dictionaries, although this project might not necessarily be geared to the needs of computational linguists.
    Resource reuse could also be applied to texts as well as dictionaries (and why not grammar rule-sets?). A large body of machine-readable texts is useful for a range of compulinguistic investigation, from checking on word use (collocations, idiom formats, etc.) to exploring the incidence of gapping constructions or complement types. You can also use them to test parsers -- or your fragment of an MT system.
    In the USA, a machine-readable text bank built up from donations from publishers (among them bits of the Wall Street Journal) is already available for the research community, but, as Donald Walker reported, everyone else is doing it too, with the British text collection initiative leading the way. One might also add the use of a collection of Agence France Presse news stories as an MT text-bed in France. Again, each national corpus must be marked up some reasonably consistent way so that everyone has access to each other’s raw material. If SGML is an obvious candidate for document structure/content coding, other crucial decisions must to be made: how should texts be sampled? Are text words to be tagged with grammatical category? Should simple parses of parts of the corpus be carried out, and, if so, how? Should text source and status be recorded?
    One last thought for the practical minded: with text collections running into millions of words, who’s going to pay for the giant disk drive I need to store them on? No answers in Berlin this year.

COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]