Océ’s robust parser


This article orginally appeared in the May-June 1992 issue of Language Industry Monitor

What is a company that is best known for high-speed copiers doing developing NLP software? Read on.

“Océ has a large research department,” says Lou Cremers, as he guides his guest from the reception area of Océ r&d through a maze of halls and buildings towards his workplace. “There are nine-hundred people working here.” Océ, he explains, was founded in the last century by a chemist who developed a coloring agent for butter. Currently, the company, which employs twelve-thousand people worldwide, is firmly established at the high-volume end of the copier and laser printer market. Corporate and manufacturing headquarters are in the town of Venlo, in the south of the Netherlands, just a few kilometers from the German border.
    Cremers and colleague Rob Heemels form the Language and Speech Technology Group within Océ’s software department. Two substantial projects of theirs will soon be seeing the light of day and they are ready to talk about them. The first of these is the Natural Language Processing Package; the second is the Intelligent Full Text Retrieval System.

Not a tree but a forest
The NLP Package, explain Cremers and Heemels, is a set of tools for developing NLP applications. It includes an input normalizer, a lexicalizer which checks entries in a lexicon, a table generator which generates a parse tree, and a parser. The parser, explains Cremers, is based on an adaption of the Tomita algorithm and is fast and robust. Because of a feature unification mechanism, it can detect and correct grammatical errors. Switching on a graphic representation of the parse of a test sentence, Cremers demonstrates how the modified Tomita parser detects all possible interpretations of a sample input text and produces not a tree but literally a forest of all possible interrelations between elements displayed graphically in a window on his workstation.
    “We’ve taken a decidedly bottom-up approach,” Heemels goes on to say. They have dedicated themselves to developing efficient and robust solutions to specific problems regarding grammar and syntax and not tried to tackle complex problems of “world knowledge” representation. He has his reservations about what he calls the “top-down” approach, espoused in such projects such as eurotra, Philip’s Rosetta, and BSO’s Esperanto-based DLT, which have hitherto failed to successfully puncture the semantic barrier. Cremers and Heemels acknowledge, how-ever, that NLP is at an impasse; these large, ambitious projects have not fulfilled their promise, while the modest tools which are available are cut from a paltry fabric indeed.
    Océ’s NL group evolved out of an Esprit project, dating from 1986, which involved the development of a prototype for an intelligent workstation. Océ’s contribution was a grammar checker and a dialog system for Dutch and English. They envisaged a kind of linguistic spreadsheet, where when you changed one word, changes in dependencies would ripple through your text. “It was just a prototype, though.” says Heemels. What does this imply? “Well, it wasn’t modular, it was all hard-coded in lisp. You had to be a programmer to adapt it, and that is un-acceptable. To make a true commercial package, we started all over again.”
    In developing the parser, Heemels and Cremers have taken a good look at what is currently on the market in pc grammar checking software, such as the ever-popular Grammatik IV and RightWriter packages. They were not impressed, dismissing them as glorified pattern-matching programs. Heemels acknowledges that many hundreds of thousands of copies of these programs have been sold, implying that they address a need, but he questions whether people really use them. “You can’t develop an effective grammar checker without an accurate parse of a sentence,” asserts Heemels. “A pattern-matching system can track certain things, like verb agreement, up to a certain point, but once interrelated elements are separated by a subordinate clause, for example, it will inevitably stumble.”
    Heemels and Cremers say the modules of the NLP package have been used to develop a number of different prototype applications, such as a grammar checker, an intelligent search and replace, intelligent OCR, a translation system, and a natural language interface for a database. They would like to make it available soon to outside developers, but it is not yet exactly clear how. In the meanwhile, a researcher from the nearby University of Nijmegen - “one of those unique people who is both an expert linguist and a computer scientist” - is currently fleshing out their Dutch grammar.

Full-text retrieval in any language
In contrast to the Natural Language Processing Package, the Intelligent Full Text Retrieval is explicitly intended as an enduser product. Currently, it runs on Unix workstations; dos and Windows versions are in preparation. According to Heemels and Cremers, the system uses linguistic knowledge to make it “smarter.” So-called “function” words (articles, prepositions, etc.) can be excluded from the full-text indexing process to increase efficiency. By supplying a synonym file, which can be either a general purpose thesaurus or a user-defined dictionary, the system automatically links related words, whereby a given keyword will return all synonyms in a query. In addition, a morphology analyzer can link all inflectional forms of a word and alternate spellings of a word can be mapped to one standard entry. All of these facilities help reduce the size of indexes and improve the accuracy of the retrieval engine. Recall precision is signi_cantly enhanced by the Concept facility, which allows users to create detailed and weighted queries. Other special features include a concordance option for viewing keywords in context and a facility for generating indexes.
    Like similar full-text retrieval packages, the Océ system ranks the documents it retrieves on the basis of criteria such as the number of occurrences of a keyword. Océ’s system is unique, however, in the fact that it is fully language independent and will automatically recognize the language of a given text, opening the very interesting possibility of multilingual textbases. Languages are recognized by means of statistical information about each language, with the function word list specific to each language compiled from the dictionaries and triggered automatically. This language independence is extendable, as the retrieval package supports a user-defined character set.
    An important concern for an company like Océ is protecting its technology; it is a partly defensive tactic. As Océ’s manager of research Klaas Kuin puts it, “you don’t want someone like Gilbert Hyatt [the holder of an until recently obscure microprocessor patent] coming along twenty years from now upsetting the apple cart.” Kuin alludes to the huge damage payments which Minolta and Kodak recently have had to make because of transgressions of competitors’ proprietary technology. Océ has already obtained a patent in the United States for its parsing technology, and is taking other steps to protect its intellectual property.

A pixel company
Explaining why a company renown for its office equipment is so involved in software, Kuin says, “we already sell a lot of software - only it’s packaged in copiers. Easily half of the value of one of our top-of-the-line copiers is the software. We regard ourselves fundamentally as a pixel company.” As in other mature high-tech industries, the profit margins in this business are in the software and the service side of things, more than in the actual hardware. “Why can’t we have a copier that can correct the spelling of a document?”, Kuin asks rhetorically. “Or one that changes the typeface of a text from Times to Univers?” That might be speculation, he says, but that is his job: to look five to ten years ahead. “In the high technology business, ups and downs are inevitable. You can count on having something of a dip every seven years or so. The only way to survive is to plan for it and think beyond your current successes. You have to realize that your position in the market will inevitably change.”
    The boundaries between photocopying, printing, and document storage are becoming less distinct and it is important for Océ to keep on top of that. Kuin says he once visited the offices of Blue Cross in Chicago where they make on the order of forty million photocopies per month. That is a situation which is obviously ripe for some form of electronic document processing. While that might mean fewer copies being made, and therefore less business for a company like Océ, it also presents new opportunities for supplying integrated new, as yet unspecified document systems. “It’s essential to know how your products are used,” says Kuin. “Not only do we need to know about the photoconducters at the heart of our machines, but it is essential that we also know how our customers use our copiers.” As good as a given technology might be, he suggests, it is irrelevant if it is not well adapted to its surroundings; this can be extended to NLP software as well, where the user-interface of a system will make it or break it.
    Winding our back to the main entrance, we pass labs where groups of researchers cluster around half-disassembled copiers, like surgeons in an operating theater, delivering the next gener-ation of office machines (“that’s top secret in there,” says Cremers). “A photocopier is as complex as an airplane,” he observes. “Chemistry, electronics, mechanical engineering, software, user ergonomy - all these disciplines are integrated into one machine.”

Océ Nederland bv Postbus 101, 5900 ma Venlo, The Netherlands Tel +31 77 593444, Fax +31 77 594313

COPYRIGHT © 1992 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]