This article orginally appeared in the Jan/Feb 1991 issue of Language Industry Monitor Much of the current software which we can loosely bunch under the category ‘text critiquing’—spell-, grammar- and style-checking as well as readability scaling—has its origins in the work of Henry Kucera, currently Professor of Slavic Studies at Brown University in Providence, RI (USA). This came about in the first instance through his seminal research in computational linguistics, then subsequently through his long and fruitful collaboration with the Software Division of Houghton-Mifflin in producing linguistic software. Henry Kucera was born in Czechoslovakia, in 1925. When the Communists came to power in 1948, he was thrown out of university the “first day,” interrupting his PhD studies in philosophy and linguistics. In 1950s, Kucera found his way to Brown University where he was able pursue further his interest in linguistics. At Brown, he became interested in the computational analysis of human language, though at the time there were scarcely any tools for this type of analysis. Assured on one occasion by a Brown MIS director that computing time was available for an early research project, he was nonetheless told he would have to program it himself, using what he describes as a rather primitive meta-assembler. In 1967, Kucera and Nelson Francis published their classic work ‘Computational Analysis of Present-Day American English’ (1967), known today simply as the ‘Brown Corpus.’ It is perhaps for this work that Kucera is best known. The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. Their interest awakened by the Brown Corpus, and especially the word frequency analysis, Boston publisher Houghton-Mifflin approached Kucera shortly thereafter to supply a million word, three-line citation base for its new American-Heritage dictionary. This ground-breaking new dictionary was the first to be compiled electronically on the basis of word frequency. Kucera still speaks fondly of the project, noting “it was led by a small but expert board of linguists.” Were there advantages to developing the citation base from scratch? “I’d rather have started with Merriam- Webster’s citation base,” he says, “but that was all on paper.” Kucera wrote his first spellchecker over Christmas, 1981, in PL1 for VAX machines, at the behest of DEC. It was in fact a rapid, simple spell verifier, he says. ”Early spellcheckers, were strictly speaking ‘verifiers.’ not ‘correctors.’ They might have been okay for typos but they weren’t so helpful for logical or phonetic errors. “The difficulty is offering useful suggestions for misspelled words,” says Kucera. “It’s very difficult to code ‘similar’. The point is not to try to make exact matches but to reduce words to ‘skeletons’ to feed to your guessing engine, which has to be both fast and accurate. Basic information theory tells us not all letters have equal value, so it’s irrelevant whether something is spelled “ie” or “ei.” We do on the fly conversion to skeletons and then rely on pattern recognition algorithms to follow up associations.” Kucera notes the two perils of writing spellcheckers: ‘overflagging’ because of too small a lexicon and ‘collisions,’ whereby incorrectly spelled words are skipped because they are mistaken for others. “Statistical analysis show us that the optimal size for a spellchecker’s word list is around 90,000 words. More than that, you start having too many collisions.” Kucera oversaw the development of Houghton-Mifflin’s sophisticated Correct Text Grammar checker, which, he says, also draws heavily on statistics. Its two-pass parser uses statistical means for a considerable amount of disambiguation, which he explains, reduces backtracking to a minimum. “It’s vital to know, for example, that spring is twenty times more likely to be a noun than a verb. But statistics don’t tell you everything, of course, like whether ‘computer design’ is a noun + noun or noun + verb.” Recently welcomed back to Czechoslovakia like the proverbial prodigal son, Kucera had a chance to see on a first-hand basis what his colleagues there and elsewhere in Eastern Europe have been doing. “There’s keen interest in machine translation in Eastern Europe,” he acknowledges, “but it’s mostly theoretical, just descriptions on paper, although I did see an excellent morphology analyzer for Czech—a highly declined language. But in general, the Eastern Europeans are short on practice. They just don’t have the computing power. And even if they did, they don’t have the personnel—the able and willing graduate students,” he adds with a wink. Despite his long involvement in and influence on these various meeting points of language and computers, Kucera is nonetheless a technological moderate at heart. Don’t throw those books away yet, he might say. “Computers shouldn’t try to reproduce the printed page. They’re really best suited for hypertext-type applications, or, for example, retrieving all the Middle Dutch roots from a dictionary.” But Kucera does look forward in the future to increasing linguistic awareness in software, citing as an example the Macintosh file utility ON Location, with its simple but effective word-stem deriving search routines. What software does Kucera himself use? “Oh, I couldn’t live without a spell checker. I’m a fast typist and make typos, even phonetic errors occasionally. You know how it is, an author always reads what he anticipates is on the page.” COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR
|