This article orginally appeared in the Nov/Dec 1993 issue of Language Industry Monitor Application—and domain—independence are essential elements in Unisys’s nlp strategy as its Paramax researchers ready their technology for the market. For the past tn years, Unisys has had a natural language research program at the Valley Forge Laboratories Research and Development facility of Paramax Systems Corporation (a Unisys subsidiary) in eastern Pennsylvania (usa). Here, a team of seven researchers are focusing on three application areas: natural language interfaces (including spoken language), data extraction from free text, and understanding of the information contained in hardcopy documents. With Paramax’s basic nlp technology ripening into maturity, the next step, says Deborah Dahl, manager of Symbolic Processing and Analysis, is to enhance the basic capabilities of the group’s technology and build it into real applications to learn about the demands applications will make of it. The core of Paramax’s nlp system is pundit (Prolog Understanding of Integrated Text). Dahl explains that the group’s goal was to build a general-purpose system that could be easily adapted to a wide range of applications. pundit is both domain-independent as well as application-independent. “This generality is a unique feature of pundit,” says Dahl. “While many nlp systems are designed for use as natural language interfaces, pundit can also form the basis for such applications as text processing, information retrieval, and grammar checking.” Airspeak Portability from one domain to another is an important issue; Dahl suggests it is not so much the difficulty of porting a natural language application but the expense of it that is hindering the wider use of nlp. The development of techniques to speed up the process of porting pundit to new applications is therefore a high priority for the Paramax researchers. pundit has formed the basis for several spoken language understanding projects, including the air traffic control system. Other prototypes have included an air travel database and an expert system offering city directions, both within the framework of the darpa sisto program. Paramax has also participated in the darpa Message Understanding Conference since its inception in 1987. Paramax does not do primary research in speech recognition; rather, it uses speech recognition technology developed elsewhere. It is currently using the sphinx system developed at Carnegie Mellon University as a front-end to its spoken language systems. With idus, Paramax’s Intelligent Document Understanding project, a number of processing technologies, including image understanding, optical character recognition (ocr), document structural analysis and text understanding, are employed in a cooperative fashion for extracting data from hardcopy documents. “Although scanning and ocr technology are critical components of this goal, they are not sufficient by themselves,” Dahl explains. “ocr, for example, alone cannot distinguish the title and author of an article from its body text, yet this information is indispensable for intelligent retrieval of articles. This is the kind of information that a system like idus will try to extract.” Dahl hopes that idus will evolve into a complete document understanding system, supporting such applications as information retrieval, reconstruction, transformations between representations, editing, routing, summarization and combinations of the above. The current idus prototype is built into a system that generates data for a text retrieval application. The economics of nlp However, Dahl is highly optimistic about the Linguistic Data Consortium (ldc), recently established at the University of Pennsylvania under the direction of Mark Liberman. The ldc, which has been given an initial two-year grant of five million dollars by darpa, should become a major repository of lexical data; its offerings will include raw text, the Penn Treebank corpus of parsed text (useful for testing and developing grammars), and spoken language data. Much of the material comes from darpa itself, collected in earlier darpa projects. The members of the ldc will share the costs of acquiring and maintaining these linguistic resources, which ldc proponents consider “pre-competitive.” Should Paramax join the consortium as a commercial partner, the ldc could prove an affordable and viable source for the linguistic resources that Dahl and colleagues will need to develop their technology further for commercial purposes. As Dahl puts it, “it’s hard to build and test comprehensive nlp systems without access to a lot of data.” Paramax Corp., Valley Forge Laboratories, 70 E. Swedesford Road, Paoli, PA 19301, US2 |