This article orginally appeared in the May-June 1993 issue of Language Industry Monitor DEC and the Oxford University Press recently concluded a noteworthy pilot study which explored a new style of lexicography. Is Hector the model for a new generation of corpus processing tools? What do lexicographers want? That was the question Mary-Claire van Leunen and her colleagues at Digital’s Systems Research Center (SRC) in Palo Alto asked themselves a number of years ago. To answer the question, they set out to design the ideal lexicographers workstation, which they called Hector. As the project evolved, DEC saw in Hector the means to achieve two important goals. The first was to build an application that would exercise SRC’s research systems; the second was to obtain a hand-disambiguated corpus for use in future research. As van Leunen recalls, the idea for Hector arose in a rather casual fashion. For the convenience of colleagues and herself, van Leunen had an online version of the Random House dictionary installed on the SRC network. At one point, a discussion arose about an entry, now long forgotten, which didn’t seem plausible, and she exclaimed, “I’d like to see the evidence for this!” One thing led to another, and over the course of several years Hector was born. This two year project, which officially drew to a close in March, was an intriguing experiment in high-tech, corpus-based lexicography. Van Leunen wrote a delightful account of Hector which she presented at last year’s New OED Conference in Waterloo, Ontario. Her account will also be published as a Digital research report. Describing impulse behind the project, she wrote: The experience of the CoBUILD project at Collins and the University of Birmingham demonstrated that lexicographers need a large corpus, a very large body of running text, to work on ordiinary words. But at CoBUILD, the lexicographers looked at their corpus on paper and impressionistically; they spread paper concordances out on their kitchen tables, marked a few lines with colored pens, and threw them away. We wanted to offer lexicographers the opportunity to use a corpus in more rigorous and creative ways.Says Guarino, “an application like Hector gave us an opportunity to test the effectiveness of some of our other research projects. We used Modula-3, a programming language developed at SRC and user interface tools developed here. Hector provided us with valuable feedback to other projects, hence its value to DEC.” As systems engineer Jim Meehan elucidates, “corpus-based lexicography is a demanding application, requiring great speed and flexibility in handling huge amounts of data. Language is by definition inexact data, and this presents a tremendous challenge in designing an effective user interface for such a system.” Oxford provided the corpus and lexicographers, and the lexicographical expertise that made the tools possible. SRC provided the high tech. At the beginning of the project (October 1990), four of the SRC’s Hector team went to Oxford for a three week training course for lexicographers. Once the software was operational, a team of lexicographers from the Oxford University Press arrived in Palo Alto and starting working with Hector. In the course of two years, the lexicographers compiled highly detailed database entries for a small set of words that occurred between 100 and 1000 times in the 18 million word corpus, and they linked each individual occurrence in the corpus to the appropriate section in the database. What does this state-of-the-art lexicographer’s workstation look like? After various trials, SRC settled on a configuration of three large, high resolution monitors for each workstation. The corpus query tool and Keyword in Context (KWIC) viewer is in the center screen, the definition template editor is in the right screen, and various utilities and online reference works are in the left screen. With its elegant X Window/Motif graphical interface and jazzy color scheme, Hector has a high Wow! factor. Instead of grouping meanings on paper printouts as in the bad old days, the Oxford lexicographers could group corpus evidence on-screen with sense tags while developing their definitions. These sense tags were later stored in the corpus, meaning the lexicographers were both writing defintions and sense-tagging a corpus. The work they did on word senses was not thrown away; a corpus tagged with word senses is a very useful thing to have. Sue Atkins was another of the Oxford lexicographers who has been spending time at SRC within the framework of the Hector project. As co-editor of the well-known Collins Robert French-English dictionary, Atkins is obviously familiar with the traditional tools of her trade, where the lexicographer’s brain is the chief repository of linguistic data. Not content to simply talk about Hector, she invited her visitor to take this inviting looking system for a test drive. We looked up “laconic” with the hunch that it has acquired a second, yet uncodified meaning, that of “relaxed” rather than the original meaning of “terse.” Several corpus examples verified this sense. The Hector project is now over, and the lexicographers have packed up and returned to England. The software that allowed them to make such fascinating explorations into the language is not portable (it was never meant to be) from the Digital Systems Lab to the workstations of Oxford. Concludes Sue Atkins, “the task now is to assess the experience gained from the project and to make the best use of what has been learned about the software functions that are necessary if corpora are to be harnessed in the attempt to build better dictionaries faster.” Corpus-based, large-scale, computerized lexicography is a complex and costly enterprise, and much detailed planning must be done before the experience gained in Hector bears fruit in the commercial world of dictionary publishing. |