Cognitech's Linguistic Software for Publishers


This article orginally appeared in the Nov/Dec 1991 issue of Language Industry Monitor

A group lead by Nijmegen's Gerard Kempen produce practical tools for processing multilingual texts.

Cognitech (not to be confused with U.S. and French companies with the same name), a group of researchers associated with the University of Nijmegen who specialize in linguistic software, has developed what it says is “ the first grammar-based spelling checker” for Dutch. Called CORRie, the system was written in the C language by Cognitech‘s Theo Voss and runs on UNIX and VMS platforms. Designed for processing large volumes of text, it operates in batch mode, marking errors and entering suggestions in the right margin which the user can later accept or disregard.
    CORRie has been running since last summer on the central VAX cluster at Nijmegen‘s Computer Center. Students and employees who have accounts with the system can send text files via e-mail to the spellchecker for overnight processing and receive the corrected text back the next day. The first commercial licensee of CORRie is Dutch publisher Samsom-Sijthoff, in Alphen aan den Rijn (it was also a testing site.) Samsom uses the system to automatically proofread the many professional and business-oriented publications, journals, and books it publishes.
    Gerard Kempen, director of Cognitech, explains that his group specifically set out to develop a spelling checker to handle the various linguistic phenomena in Dutch and other languages which spelling checkers originally written for English frequently stumble over. CORRie should not be mistaken for a grammar checker, adds Cognitech‘s Alice Dijkstra. It is in fact spelling checker which uses extensive linguistic knowledge to achieve high accuracy rates in Dutch or other languages for which it could be adapted.
    Verb agreement
The package is built up of three modules. The first module is a parser specifically designed to automatically detect and correct verb agreement errors. These are common in languages such as Dutch, French, Spanish, and Italian (but not English) when different verb forms sound the same (e.g. and , where the first item sound the same in Dutch). To do this, it performs both morphological (word) and syntactic (sentence) parsing, rather than simply matching a form against a list, like the spellchecker built into your wordprocessor. A key feature, says Kempen, is the system‘s robustness: it can correct what he refers to as “regularizations of fossilized multiword idioms.” These are the obsolete forms (such as “ te allen tijde” ) which still occur regularly in modern Dutch.
    Another module consists of what Cognitech calls a “ word recognizer.” This uses a combined statistical and linguistic approach together with a lexicon containing some basic labelling (part of speech and such features as number) to identify and correct words with orthographical and typographical errors. Kempen claims a high level of accuracy for the module, which provides just a few suggestions, with the correct alternative presented first. The word recognizer also uses phonetic information derived from a grapheme-to-phoneme table. According to Kempen, this can be adapted for other languages but requires careful fine-tuning to achieve optimal results.
    The third module is a compound analyzer, a crucial feature for spellchecking in Germanic and Scandinavian languages, where sequences of two or more words are routinely compounded yet cannot be expected to be found in a wordlist. It decomposes compounds into their constituent words and infixes using the lexicon of the word recognizer. Compound analysis is also indispensable for correctly hyphenating compound-rich languages. Kempen suggests both the word recognizer and the compound analyzer could prove useful tools for searching full-text databases and improving OCR output as well as wordprocessing.
    Cognitech has also developed a hypenator generator which produces a small but complete set of hyphenation rules for any language on the basis of a hyphenated lexicon. A unique feature of it is the possibility of combining conventions for two languages. It can generate rules for German, for example, which will also correctly hyphenate English words found in a German text. This could be a useful utility for publishers proofreading text containing titles, technical terms and names from other languages.
    A non-profit stichting, Cognitech was established by Kempen in 1987 out of frustration at being unable to find co-development and marketing partnerships in industry for the software and skills he had developed with Dijkstra and colleagues at Nijmegen. “ Companies tend to think that if something comes out of the academic sphere, it isn‘t for real, or is somehow unfinished. Often they don‘t want to pay anything for it, thinking that if it comes from a state university it should be free,” says Kempen. Cognitech is a means for him and his colleagues to flex their entrepreneurial muscles and exploit the fruits of their labors outside the university.
    Cognitech has supplied its linguistic services to Dutch publisher Van Dale Lexicografie by expanding the lemmas of the latter‘s Dictionary of Present Day Dutch, recently published electronically as Lexitron on CD-ROM. It also supplied a routine for Lexitron‘s lookup software allowing users to search for words phonetically. Another project was automatically generating pronunciations for the recent Robert & Van Dale Dutch-French dictionary.

Stichting Cognitieve Technolgie, Lombokstraat 22, Nijmegen, 6524 LT, The Netherlands; Tel +31 80 512621, email: kempen@nici.kun.nl

COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]