This article orginally appeared in the Sept-Oct 1994 issue of Language Industry Monitor At the forefront of linguistic engineering in France — and ,beyond? ‘ French, not English, is the operational language of the world’s busiest high,end NLP application. Every day, some half a million people across France turn to their familiar beige Minitel terminals to consult the online Pages ]aunes, the French Yellow Pages. While the rudimentary Videotex interface may suggest a simple keyword look-up, the system — beknown to its legions of users — boasts in fact a robust natural language interface which integrates computational linguistics, software engineering, and a mountain of data. At the heart of the system is a semantic network of some forty thousand terms which represents the world of the French Yellow Pages. This unique system was launched in the Brittany region of France in 1990; by 1993 it was online throughout France, a reflection of the system’s robustness and flexibility. As is readily apparent when you try it yourself, the interface is not reliant on simple keyword searching but performs some sophisticated linguistic analysis, still a rarity, surprisingly enough, in commercial information retrieval applications. While other parts of the Minitel system may enjoy a slightly seedy reputation, the Yellow Pages reasoning is comme il faut. “Je veux une dame” deposits us decorously in the listings of marriage agencies. The developer of the natural language interface to France’s on-line Yellow Pages is GSI-Erli, a French firm which has built up extensive expertise in the field of language technology, including natural language interface design, AI, restricted language, full-text indexing, information retrieval, and even machine translation. This’ potent combination of computational linguistic techniques and software engineering savvy, acquired through the develop’ ment of a series of industrial-strength applications, gives substantial credence to GSI-Erli’s claim to being Europe’s pre-eminent linguistic engineering firm. The only comparable organization which immediately comes to mind is Eurolang, just four stops away from GSI-Erli on the St Remy-Les-Chevreuse metro line, but as GSI-Erli’s Bernard Normier points out, although the two companies may be similar in size they are vastly different in orientation. While it may be moving slowly down-market, GSI-Erli is still largely project-driven; Eurolang, in contrast, is concentrating all its efforts on off-the-shelf commercial applications. Both, however, are part of larger organizations. GSI-Erli is a subsidiary of GSI (Generale de Service Informatique), France’s fourth largest software house, with 1993 sales of FFR 2.5 billion. GSI-Erli was founded in 1977 by Bernard Normier’, its current director, and boasted sales of nearly FFR 30 million for 1993. Its staff currently numbers fifty-five. Reflecting its stake in the online Yellow Paggges, France Telecom has taken a thirty percent internet in GSI-Erli. Mirroring the real world, France’s Yellow Pages continues to evolve, and that means regularly adapting and refining both the query interface and the underlying knowledge base. GSI-Erli has deployed a monitoring tool, called Oscar, which’ collects and analyzes users’ queries and enables GSIErli to continually fine-tune the input analyzer. For updating the knowledge’ base, the company also supplied a tool, AlethGraf, which represents graphicall y the forty-thousand semantic entries in Yellow Pages, domain hierarchy on a Unix workstation. The visual rendering of such complex structures has become something of a specialty for GSI-Erli; the company also deployed a similar application for INSEE, the French National Bureau of Statistics, to help users find the correct entry in very large nomenclatures of activi,ties and products. GSI-Erli has since developed an online Yellow Pages for the Italian PTT and has developed a prototype for US-West. The company claims it only took a couple of months to port the natural language interface component for the prototype’, using the Aleth toolbox (See sidebar ). Another notable GSI-Erli proje’ct is a letter generation system ,for French mail order company La Redoute. With two million customers, the company was faced with the daunting task of handling three thousand pieces of corrspondence a day. Le Redoute was not happy with the stylistic results of a previous template-based, system. Because they couldn’t remember the’ large number of boilerplate texts, users tended to stick with a small subset of texts, with less than optimal results. Dissatisfied, La Redoute turned to GSI-Erli for a more sophisticated solution. The system which GSI-Erli built for La Redoute generates correspondence based on a symbolic representation of the problem. In operation, a user reads an incoming letter, enters the .essential facts in the system and a basic course of action. Based on a formalized description of the situation, the system automatically generates a letter which is grammatically, rhetorically, and stylistically correct. GSI-Erli says that it is the first system of its kind in Europe. Long-time NLP watchers will know that in the us, Reader’s Digest commissioned a similar system from Roger Shank’s company Cognitive Systems some years back. For several customers which produce large quantities of technical documentation, most notably Aerospatiale, GSI-Erli has also developed a translation workstation, AlethTrad, which it customizes for specific applications. Normier says a thorough analysis of a small sample source text corpus can supply a prospective customer with a good indica, tion of the potential productivity gains which can be achieved with AlethTrad, particularly in combi, nation with restrictions imposed on the source text. As befits a commercial software house, these custom applications are undertaken on a project basis for specific customers. But thanks to its strong ties with the research world, GSI-Erli is also a frequent recipient of French and European government funding; it is a frequent — and desirable partner for academic-industrial consortia, currently much in favor among funding agencies. Such activities, says Normier, represent “a compromise between academic and industrial ideas.” GENELEX, which wound up in 1994, is probably the most prestigious project in which the company has been involved. launched in 1990, this Eureka (nationally funded) project, bankrolled ECU23 million, had as its ambitious goal the development of generic computational dictionaries for French, Italian, and Portuguese. The motivation behind the project is a now familiar litany. On the one hand, developing application-oriented dictionaries from scratch is highly labor, and cost,intensive. On the other hand, re,using dictionaries created in idiosyncratic formats can be enormously difficult. Hence the raison d’etre for GENELEX. On the basis of a generic dictionary model, the consortium has created a lexical database (in SGML and object-oriented form) from which application dictionaries can be extracted. This, says GSI-Erli, enables dictionary maintenance to be centralized for additional economies of scale. In addition to lexical data, the GENELEX consortium has also developed an object-oriented scheme for displaying the linguistic data graphically and a front-end to help lexicographers structure lexical data more easily, thereby ensuring consistency. GSI-Erli’s work in GENELEX formed the basis for AlethDic, GSI-Erli’s French dictionary. GSI-Erli’s partners also produced Italian and Portuguese dictionaries, although these are less complete than the French one; it has also completed a kernel for an English dictionary. As is customary in “cost-shared” projects, GSI-Erli owns the copyright on the materials it developed within GENELEX. According to Norrnier, the GENELEX consortium has released the GENELEX specification to the public domain and GSI#Erli is currently exploring the possibility of licensing the data which it has developed in this project to third parties. GSI#Erli is actively involved in the EAGLES initiative to develop guidelines and proto-standards for language engineering, and as the host organization of the dictionary working group it is proposing that the GENELEX model serve as the basis for the dictionary recommendations. Another important ongoing Eureka project in which GSI-Erli is participating is GRAAL (Grammars Re-usable to Automatically Analyze Languages), funded more than ECU20 million. The large GRAAL consortium (eleven partners) hopes to accomplish for grammars what GENELEX did for dictionaries, namely constructing an application-independent foundation for writing them. Using GRAAL tools, the consortium plans to implement a series of pilot applications for several of its industrial partners (Aerospatiale, Lingsoft, and Nokia) for such tasks as automatic indexing, knowledge extraction, and computer-aided translation. More recently, GSI#Erli has embarked upon TransTerrn, a project in the second round of the CEC’s LRE program. Here, the goal is to develop standard methods for enriching terminology data# bases with linguistic information for the purpose of integrating them within NLP dictionaries. There is an abundance of terminological data in the world today but very little of it is in a form which can be used, for example, by an MT system. Despite high profile industrial projects like the French Yellow Pages and prestigious R&D efforts like GENELEX and GRAAL in his company’s portfolio, Bernard Norrnier is reluctant to be overly optimistic. “The era of these big isolated projects is behind us,” states Norrnier unequivocally. “There simply aren’t that many companies which can fund the development of large-scale systems.” As Normier sees it, the challenge now facing GSI-Erli is moving from big custom-built systems, built virtually from scratch, towards generic systems which can be customized for specific domains. Perceiving the software industry as a pyramid, he sees a small number of very big customers at the top; the further down the pyramid one goes the smaller but more numerous the customers are. Norrnier feels GSI-Erli had exhausted the small but lucrative top tier. For it to survive, says Norrnier, it must move down the pyramid. Is the company succeeding? Well, says Norrnier, five or ten years ago, it has just a small handful of customers but now it counts fifty. He hopes that someday its customer base will be several hundred. Ultimately, this boils down to reusability, probabl y the key issue in linguistic engineering today, for as Normier says, “to be price competitive, you must reuse.” In the course of many projects, GSI-Erli has expended extensive effort developing tools and resources. However, as he points out, “the logic of the customer doesn’t always correspond to the logic of internal needs.” Putting it another way, there is a natural conflict between project-oriented goals and the goals of reusability. As Normier explains, “not surprisingly, our customers don’t want to pay for the basic language capabilities. They feel that that is application-independent. Naturally, they only want to pay the linguistic coverage specific to their application.” The challenge, therefore, is a balancing act in which the needs of given projects are attended to while contributions are made to the company’s basic linguistic technology. And that means ensuring a steady stream of suitable projects. Some welcome relief to this predicament is government funding, both from national and EU funding agencies, and according to Normier proj ects like GENELEX have had a considerable impact on GSI-Erli. “Previously, we were almost entirely project-driven, that is to say, we built each of these systems largely in isolation from each other. Each project created its own dictionary, grammars, etc. There was virtually no reuse!” GSI-Erli is still project-oriented but Normier says the company is now structured by application area. It has teams dedicated to translation technology, documentation (indexing), and technical writing (encompassing restricted language applications). In addition, it also has a team dedicated to language resources, which form the basis for all the applications. In dealing with linguistic software, it is imperative to understanding exactly how the software works, cautions Normier. “A black box philosophy — it works because it works — is simply unacceptable.” For Normier, the situation in language engineer’ ing today (latent demand paired with slow adoption) is reminiscent of a similar situation several decades ago. Normier: “In the early 1970s, relational data, base technology was regarded with a great deal of scepticism. People said: ‘it will never work.’ At the time, there was no common view of what the relational model was, no shared terminology, no methodologies for implementing it, and, not surprisingly, no products.” He points out that much the same can he said about linguistic engineering today. “The relational database was an expensive technology went it first appeared, but as it became more widely used, it also became more affordable. I expect the same to happen here as well.” While the analogy may be imperfect, Normier nonetheless believes that “a common view of what language engineering is” is finally emerging; he even sees this happening over the next few years. With France’s surprisingly vital language engineer’ ing industry, together with a highly active contin, gent in Quebec, the French language places a close second to English, easily beating German, in the depth and breadth of language processing applica, tions created for it. Normier plays down the impact of Francophone chauvinism on this state of affairs, although he acknowledges that the passing of a recent “linguistic purity” law by the French minister of culture may stimulate interest in technological “solutions.” Rather, he points out that the substan’ tial support France gives to linguistic research supplies fertile soil for subsequent industrial exploitation. Says Brian Oakley, former director of the British Alvey program, “the French take their language, and therefore language engineering, very seriously.” As GSI-Erli demonstrates. (See sidebar about) GSI-Erli, 1, Place des Marseillais, F-94227 Charenton-le-Pont, France; Tel: +33 1 48 93 81 21, Fax: +33 1 43 75 7979 COPYRIGHT © 1994 BY LANGUAGE INDUSTRY MONITOR
|