GSI-ERLI's Hard-working Dictionaries | By Andrew Joscelyne This article orginally appeared in the Jul/Aug 1991 issue of Language Industry Monitor February 1990 saw the initiation of an ambitious inhouse dictionary project called BENEDICTE at Paris systems house GSI-ERLI. Director Bernard Normier told Language Industry Monitor more about the project and explained how it fits in with his company’ s overall commercial strategy. With $4.8 million annual sales in NLP, GSI-ERLI is unequivocally the leading natural language processing company in France. Over the past fourteen years, ERLI has accumulated vast experience in this field by supplying natlang interfaces for both private and public databases, indexing and query tools for full-text bases, and a range of multilingual applications in French, Italian and English. In recent years, ERLI has become increasingly interested in converting the lexical resources developed for these various applications into a format accessible for its language analysis software ALETH as well as other generic dictionary models. In particular, Genelex (ERLI is a key partner), the dictionary project set up at the end of last year under the auspices of the Eureka program (see sidebar). Now, a year and a half into Benedicte, ERLI can now begin to evaluate the preliminary encoding of dictionary entries. ERLI sees its job as delivering systems that work under industrial constraints. Normier believes that the crucial variable in an industrially robust dictionary is not the actual number of entries, but the quality of the morphological, syntactic, and semantic description for each entry. The strength of their methodology, says Bernard Normier, lies in the fact that ERLI dictionaries are permanently confronted by real applications for specific client requirements. For Benedicte, ERLI is drawing on a set of dictionaries it developed for its clients’ applications and various lexicographical resources derived from agreements with publishers such as Hachette and Collins, whose Cobuild stock will form the basis for the extension of ERLI’ s dictionary into English. ERLI is particularly interested, for example, in developing natural language interfaces which allow access to databases where the query language might be in one language whereas the database might consist of text in another. The Benedicte project is essentially open-ended with no ceiling on the number of entries planned. The data is regularly being mapped onto the dictionary entry model that the database management software will handle. As the project advances, allowances are being made for improving the entry classifications where necessary. Semantic coding of the entries is still at a relatively primitive stage of development. ERLI’ s research in this area is focused on defining the range of major semantic classes that can be used for short term purposes, rather than the construction of semi-exhaustive feature sets. Particular attention is being paid to derivation and synonymy relations which prove highly useful in full text information retrieval searches, as well as potential computer-assisted translation applications. While Benedicte represents a major investment in standardizing dictionary entries for repeated use in customized applications, ERLI is also working at the level of grammar analyzers. Here, the aim is to produce parsers as generic as the dictionaries. With their GRAAL (Grammars for Reuse in the Automatic Analysis of Language) project, ERLI will be developing a hard core of reusable grammars available for a range of NLP applications (CAT, mono- and multilingual indexing, database querying, etc.) drawing on Benedicte-type dictionary material. Once again, the goal is adapting NLP technology to rigorous industrial constraints, where processing time, cost, and system ergonomy are key paramters. While these projects gradually build up momentum, ERLI is also receiving useful feedback from a recent (and apparently highly successful) NLP application: a language processing module for the vast online version of the French Yellow Pages. Accessible through Minitel (the French public videotex system), this Yellow Pages is one of the largest databases in the world queried by non-specialist users. In March, a pilot version of the new interface, was brought online through Centre Language Naturel (CLN) in Rennes, Britanny. By year’ s end, the new interface will be extended to the rest of the country. Where previously a keyword search device would inevitably trawl a large number of irrelevant catches from the database, the new CLN, embodying a robust parser and enriched phrase dictionaries, can reduce request noise by orders of magnitude, despite the fact that there is almost no pre-control over the kind of language that users can key in at their little beige terminals. GSI-ERLI, 1 place de Marseille, Charenton le Pont CEDEX, 94220, France; +33 1 48 93 8121, Fax +33 1 43 75 79 79 COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR
|