This article orginally appeared in the Jan/Feb 1993 issue of Language Industry Monitor Jan Scholtes combines a keen interest in language technology with a sure-footed entrepreneurial spirit. His specialty? Using statistics to improve the performance of commercial software. “What can we do with statistics? All kinds of things,” says Jan Scholtes of msc Information Retrieval Technologies. “We can improve the accuracy of ocr and information retrieval software with some basic bigram and trigram statistics for a language.” Bigrams and trigrams, formally known as n-grams, are co-occurrences of two or more entities, be they letters, words, etc. Naturally, this data is language-specific. Scholtes recalls that the first Dutch edition of a well-known optical character recognition (ocr) package came with a Dutch user manual and Dutch interface but still used English n-gram data. “Useless,” he says, shaking his head. “If you have trigram statistics for English characters,” continues Scholtes, “you know that the chance of an e following a t and an h is ninety-eight percent. That’s very helpful for an ocr program to know when it’s processing an English text. In Dutch, however, that’s a very rare trigram. Something like aar is frequent, whereas that rarely occurs in English.” “There was a big scramble awhile ago for word bigrams and character bi-, tri-, 4-, and 5-grams for Cyrillic,” says Scholtes. “Thanks to glasnost, the Library of Congress received a container full of documents in Russian and was offering a big reward to the company that could scan in the material. We were approached about supplying n-gram data by one of the ocr companies. It seems we were one of the few companies which had collected that kind of linguistic data.” Active on all fronts Scholtes established msc Information Retrieval Technology as a commercial outlet for his interests in computational linguistics - corpus-based linguistics in particular. The bread-and-butter work of msc’s staff of five is currently more along the lines of traditional consulting activities: network maintenance, training, and a bit of custom application programming. msc is also the Dutch distributor for a number of mainstream text retrieval packages for the pc and the Mac, such as Sonar (Virginia Systems) and Zyindex (ZyLab/Information Dimensions). It also imports the ocr package WordScan (Calera), Asymetrix’s ToolBook, a visual programming package for Windows, and Computer Library on cd-rom. Scholtes sees these activities as both necessary and logical steps towards developing some unique expertise in the field of information retrieval. nlp-based technologies still need to ripen fully before they can be commercially exploited, and Scholtes realizes that it will be a few years before msc can earn a living purely on the basis of its nlp know-how, hence these other, more conventional activities. “ocr is a technology that was commercialized too soon,” he points out. “The current systems have been too frustrating to use because they have no knowledge of the underlying structure of the language. ocr is similar to speech recognition. If you try to do it blindly, just employing standard Hidden Markov pattern matching techniques, for example, you can only go so far.” He has come to the interesting conclusion that ocr does not have a tremendous future. It is useful now, but in time, as more and more information is circulated in electronic form, it will become less important. In any event, information retrieval (ir) is a technology whose time has clearly come. Or, to put it more concretely, ir products are arriving on people’s digital desktops. For the popular Windows (and now dos) package Zyindex, msc produced a fully localized version for Holland and Flanders, including a Dutch stop-word list and a complete thesaurus licensed from a Belgian publisher. msc also exploited its statistics for character frequency, which enabled it to improve the retrieval engine’s low bit-rate encoding, or simply data compression. “The popular Hoffman algorithm is based on English letter frequencies,” explains Scholtes. “When we localized Zyindex, we achieved a twenty-percent savings in the size of indexes by tuning its Hoffman-based compression routines for Dutch. The letter k, for example, is relatively infrequent in English and can therefore be assigned to a longer bitmap code. In Dutch, however, that letter is much more frequent. So we assign it a shorter code.” For Dutch, which has many similarities with English, the savings may be modest; for a relatively more exotic language like Turkish, compression ratios can be vastly improved. The text collector Scholtes is an avid collector of texts; he has acquired many hundreds of megabytes of text over the years for various languages. “The quality of a corpus is very important,” he says. “It should have has few possible typing errors and spread across as many domains as possible. I pick up texts from all over. From the networks, from cd-roms, from other sources.” Scholtes not only collects texts but also creates the tools to obtain the required statistical data. As a result, msc can not only process new texts as they are obtained but also produce custom applications for specific purposes. “If you give us say twenty megabytes of texts your company has produced, we can generate a thesaurus of relevant terms which you could use for information retrieval purposes. The analysis is never complete, of course, but we can get about fifty or sixty percent. The remaining we would then do by hand.” Scholtes sees a rosy future in the market for the development of custom applications built from standard components, such as Zyindex, Visual Basic, and ToolBook, and enhanced with msc’s linguistic software skills. Currently, msc has statistics for twenty languages at its disposal. These provide the raw statistics for such tasks as building stop-word lists and enhancing compression routines. Having statistics for the character co-occurrences of various languages also makes it possible to determine the language a text is in. Scholtes has written a routine that does this in just three words. Thanks to the success of the Dutch edition of Zyindex, msc has just been signed up by ZyLab to prepare a German version of the program and now has a German native speaker preparing the necessary materials. The German Zyindex should be ready this coming April. Scholtes says Germany is a much more important software market than the Netherlands, but he adds that it is nonetheless a useful beachhead for companies entering Europe. “Holland is ideal for testing the waters,” he says. “It’s very heterogeneous, yet small enough not to require a huge investment. Moreover, the Dutch are comfortable working with English. A program’s interface and documentation does not have to be translated immediately. For the Germans, everything must be thoroughly localized, and it better be sehr grndlich!” Scholtes adds that in some ways it is easier to localize products for Germany. “We inconsistently mix Dutch and English terminology for things like keyboard and monitor. The Germans, however, have a well-defined German word for everything. That simplifies things.” A French version of Zyindex could also be possible, but he believes France is a smaller, more difficult market than Germany. Commerce and academia Scholtes combines his technical savvy with a good nose for business and an appreciation of the need for good marketing. “I’ve learned a lot from American companies,” he says. “I’ve spent a lot of time in the us, visiting companies like ZyLab, going to trade shows, attending conferences.” Good support is a marketing tool, and because his company knows its products inside and out, it can support them well. And a natural extension of good technical support are the kind of custom applications built up with standard packages that he has alluded to already. Scholtes has put this software to work within his own organization as well. He has large collections of technical literature and Unix Bitnet archives on-line, scanned in where necessary, and indexed with Zyindex for instant retrieval. Scholtes has close ties with the University of Amsterdam, where he teaches a day a week and where he is getting a PhD this month based on his dissertation, Neural Networks for Natural Language Processing and Information Retrieval. In particular, he has worked closely with Remko Scha, a professor in the school’s alfa informatica department, roughly translatable as “computers and humanities.” Like Scholtes, Scha is no stranger to commercial environments: he came to the University of Amsterdam four years ago from Bolt, Beranek & Newman’s Speech Group and, previous to that, from Philips, where he was a colleague of Jan Landsbergen, who later went on to launch Philip’s illustrious Rosetta mt project. Scholtes clearly feeds on the rich nourishment of ideas which circulate in the academic environment, referring regularly in conversation to the work of Scha, in particular to his data-oriented parsing project, which uses a large annotated, or parsed, corpus as the basis for text analysis. The university in turn also gets something back: it benefits from having msc as a place where students can spend a semester and acquire some hands-on experience in a commercial setting while writing their senior theses. “The past forty years have shown us you can only get so far with a logical modeling of language,” says Jan Scholtes. “We seem to be reaching the limits of generative grammar - if only because the sheer number of rules required is overwhelming us. Statistics can help us around some of these bottlenecks.” He acknowledges that incorporating probabilities into parsers brings with it its own complexities; it is very difficult to adjust the weightings without upsetting the whole apple cart. The appeal to Scholtes of the data-oriented parser of Scha is that it takes a different approach: it derives its frequency statistics from the annotated corpus. This means, in theory, that the bigger the corpus, the more accurate the parser. “Statistics are very important in the retrieval of large amounts of freely structured information,” sums up Scholtes. “It’s taking a structured approach to unstructured data. But if your statistics aren’t accurate, you’re better off not using them. Particularly with parsing, if statistics fail, they fail completely.” Thinking ahead Looking towards the future, Scholtes believes specific technologies come into their own at the right moment. “Ten years ago, no one needed full-text retrieval on their desktops because people didn’t have that much text. Now, everyone has got a big harddisk with thousands of files.” He sees online filtering as a technology with enormous potential. “The Solomon Brothers, an investment banking house in New York City, has a system which automatically scans incoming stock market information and routes it to the relevant analyst. Now, your average office doesn’t have that kind of online information coming in, so there’s no demand for a commercial package that can do that. But over five, ten years, when everyone is hooked up to isdn? ” No doubt Jan Scholtes will be ready when that day arrives. msc Information Retrieval Technology, Dufaystraat 1, Amsterdam 1075 GR, The Netherlands; Tel +31 20 679 4273, Fax +31 20 671 0793 COPYRIGHT © 1993 BY LANGUAGE INDUSTRY MONITOR
|