The Lexicographer’s Dream Machine


This article orginally appeared in the May-June 1993 issue of Language Industry Monitor

DEC and the Oxford University Press recently concluded a noteworthy pilot study which explored a new style of lexicography. Is Hector the model for a new generation of corpus processing tools?

What do lexicographers want? That was the question Mary-Claire van Leunen and her colleagues at Digital’s Systems Research Center (SRC) in Palo Alto asked themselves a number of years ago. To answer the question, they set out to design the ideal lexicographers workstation, which they called Hector. As the project evolved, DEC saw in Hector the means to achieve two important goals. The first was to build an application that would exercise SRC’s research systems; the second was to obtain a hand-disambiguated corpus for use in future research. As van Leunen recalls, the idea for Hector arose in a rather casual fashion. For the convenience of colleagues and herself, van Leunen had an online version of the Random House dictionary installed on the SRC network. At one point, a discussion arose about an entry, now long forgotten, which didn’t seem plausible, and she exclaimed, “I’d like to see the evidence for this!” One thing led to another, and over the course of several years Hector was born. This two year project, which officially drew to a close in March, was an intriguing experiment in high-tech, corpus-based lexicography. Van Leunen wrote a delightful account of Hector which she presented at last year’s New OED Conference in Waterloo, Ontario. Her account will also be published as a Digital research report. Describing impulse behind the project, she wrote:

The experience of the CoBUILD project at Collins and the University of Birmingham demonstrated that lexicographers need a large corpus, a very large body of running text, to work on ordiinary words. But at CoBUILD, the lexicographers looked at their corpus on paper and impressionistically; they spread paper concordances out on their kitchen tables, marked a few lines with colored pens, and threw them away. We wanted to offer lexicographers the opportunity to use a corpus in more rigorous and creative ways.
Says Guarino, “an application like Hector gave us an opportunity to test the effectiveness of some of our other research projects. We used Modula-3, a programming language developed at SRC and user interface tools developed here. Hector provided us with valuable feedback to other projects, hence its value to DEC.” As systems engineer Jim Meehan elucidates, “corpus-based lexicography is a demanding application, requiring great speed and flexibility in handling huge amounts of data. Language is by definition inexact data, and this presents a tremendous challenge in designing an effective user interface for such a system.”
    Oxford provided the corpus and lexicographers, and the lexicographical expertise that made the tools possible. SRC provided the high tech. At the beginning of the project (October 1990), four of the SRC’s Hector team went to Oxford for a three week training course for lexicographers. Once the software was operational, a team of lexicographers from the Oxford University Press arrived in Palo Alto and starting working with Hector. In the course of two years, the lexicographers compiled highly detailed database entries for a small set of words that occurred between 100 and 1000 times in the 18 million word corpus, and they linked each individual occurrence in the corpus to the appropriate section in the database.

What does this state-of-the-art lexicographer’s workstation look like? After various trials, SRC settled on a configuration of three large, high resolution monitors for each workstation. The corpus query tool and Keyword in Context (KWIC) viewer is in the center screen, the definition template editor is in the right screen, and various utilities and online reference works are in the left screen. With its elegant X Window/Motif graphical interface and jazzy color scheme, Hector has a high Wow! factor. Instead of grouping meanings on paper printouts as in the bad old days, the Oxford lexicographers could group corpus evidence on-screen with sense tags while developing their definitions. These sense tags were later stored in the corpus, meaning the lexicographers were both writing defintions and sense-tagging a corpus. The work they did on word senses was not thrown away; a corpus tagged with word senses is a very useful thing to have.
    Because the corpus was cleaned up and tagged for parts of speech, the lexicographers had a high level of control over the queries they can make across the corpus. And, as you would expect, a corpus reveals interesting things. Rosamund Moon was one of the Oxford lexicographers stationed at SRC and she points out that the natural inclination of the lexicographer is to look for unusual meanings of words. As a result, traditional dictionaries sometimes have startling lacunas in common senses for ordinary words. She noted, for example, that the metaphorical meaning of the word “capture,” as in “capture two seats in Parliament,” had been overlooked in dictionaries compiled without the aid of a corpus. More recent corpus-based dictionaries, such as CoBUILD Student’s and the American Heritage III, do include this meaning.

Sue Atkins was another of the Oxford lexicographers who has been spending time at SRC within the framework of the Hector project. As co-editor of the well-known Collins Robert French-English dictionary, Atkins is obviously familiar with the traditional tools of her trade, where the lexicographer’s brain is the chief repository of linguistic data. Not content to simply talk about Hector, she invited her visitor to take this inviting looking system for a test drive. We looked up “laconic” with the hunch that it has acquired a second, yet uncodified meaning, that of “relaxed” rather than the original meaning of “terse.” Several corpus examples verified this sense.
    Atkins is clearly delighted with possibilities suggested by Hector, but equally clearly the corpus-based approach represents only half of the equation to her. The use of corpora in lexicography also require a corresponding theoretical basis for interpreting its evidence; Atkins suggests this issue has been overlooked in the past. It was this search for a formal grounding for lexicography, says Atkins, which motivated her to study theoretical linguistics under John Lyons at Edinburgh, twenty years after her modem languages degree there. More recently, she has found a kindred spirit in linguist Charles Fillmore at Berkeley, in particular with regard to his theory of frame semantics, and two have been collaborating on research projects. Fillmore and Atkins appear to have highly complementary interests: Fillmore the linguist delights in finding evidence for his linguistic theories in corpora; Atkins the lexicographer welcomes such theories to help her interpret corpora evidence, thereby making it possible for her to compile more accurate dictionaries.
    Atkins is also participating in the LRE-funded DELIS project launched this spring, in which a consortium of European linguists and lexicographers are studying corpus usages, in five languages, of two types of words (those relating to perception and speech acts) and mapping them to descriptions in Fillmore’s frame semantics. The many subtle variations in the way in which similar meanings are expressed in the words and the grammar of different languages are not well understood, and are certainly too complex for conventional dictionaries, whether designed for humans or computers. Fillmore’s frame semantics approach, which analyzes meaning and records the way in which delicate shades of meaning get expressed, offers a way of bringing lexicography and linguistics closer together.

The Hector project is now over, and the lexicographers have packed up and returned to England. The software that allowed them to make such fascinating explorations into the language is not portable (it was never meant to be) from the Digital Systems Lab to the workstations of Oxford. Concludes Sue Atkins, “the task now is to assess the experience gained from the project and to make the best use of what has been learned about the software functions that are necessary if corpora are to be harnessed in the attempt to build better dictionaries faster.” Corpus-based, large-scale, computerized lexicography is a complex and costly enterprise, and much detailed planning must be done before the experience gained in Hector bears fruit in the commercial world of dictionary publishing.
    While DEC is not getting into the lexicography business, parts of Hector, such as the inflection algorithms, fast fulltext indexing algorithms, and data compressions schemes, could well be recycled in the office automation applications which form an important part of DEC’s core business. The tagged Oxford corpus could be for other purposes. Loretta Guarino: “We envisage using a fully-disambiguated corpus to test computer programs which analyze language. This could prove to be quite valuable for information retrieval or automatic document sorting systems.” The corpus is still accessible through the SRC network; apparently a 20 million word corpus is handy to have around. SRC researchers turn to it casually to check their intuitions about words and phrases. “Publishers take note,” wrote van Leunen in a lighthearted conclusion to her Hector account, “A matching dictionary and corpus set, bound in the electronic equivalent of morocco, might be the Christmas gift for 1996.”

COPYRIGHT © 1993 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]