Behind the new O.E.D. on cd-rom


This article orginally appeared in the Mar/Apr 1992 issue of Language Industry Monitor

While reports of the death of print have been greatly exaggerated, an impressive new cd-rom version of the venerable Oxford English Dictionary could be the writing on the wall for many an overweight reference work. and Software is at the center of this and many other exciting developments in the world of electronic publishing.
    “We came to electronic publishing through the Bible,” relates Hans Abbink, co-founder and director of AND Software, Rotterdam. No, it was not a case of divine inspiration, rather one of worldly entrepreneurship: Abbink proposed the idea of an electronic edition of God’s Word to the Dutch Bible Society several years ago. It liked the concept and gave AND its blessing to go ahead with the project.
    The electronic bible AND produced had a verse-level database structure as well as full-text index, which allows, for example, all the instances of a word such as zond (“sin”) to be retrieved in less than a second and viewed in context.
    Selections can be printed or saved to disk. The electronic Bible was well received by the members of the Bible Society, who have already commissioned an update which will allow them to enter annotations. So as not to lead anyone into temptation, the package is copy-protected.
    “Data compression and retrieval is just one of our specialties,” Hans Abbink is quick to point out. “We have people working in other areas as well, such as geographical software, operations research, and, more recently, personal identification and nlp. However, we do indeed regularly employ our data compression techniques for those applications as well.” None-theless, AND is probably best known for the numerous electronic publishing projects in which it has been involved; it is any case the primary impulse for Language Industry Monitor to make the trek to Rotterdam.

CompLex
AND has developed an in-house library of routines, called CompLex, for compressing and indexing data and building retrieval applications. “Using this system, we can produce a prototype application in days,” says Abbink. “A big variable is, of course, the quality and consistency of the original data.” CompLex is also the name AND uses for its retrieval software; it runs under dos, Windows 3, and on the Mac and can access multiple reference works at the same time.
    In the past several years, AND has produced a postcode database for the Dutch ptt (and compressed the original data from 80 mb to 3.3 mb for distribution on diskette), Vertaal!, a two-way Dutch–English pop-up dictionary for Dutch publisher Spectrum, and pc-Wörterbuch for Langenscheidt. Other customers have included Robert, Duden, Meyers, and Hutchinson. Many of the AND-produced titles are now bundled with a new line of pcs recently announced by Philips. These are supplied with a cd-rom drive and a cd-rom containing a number of reference works. Six versions of the cd-rom have been produced, in Dutch, English, French, German, Spanish, and Italian, each with a dictionary and encyclopedia in the respective language and and’s CompLex software.
    Ongoing projects include a new, as yet unannounced series of reference works for HarperCollins and preparation for publication on cd-rom of the famed Index Kewensis for the Royal Botanical Gardens in Kew. The Index Kewensis, originally begun 150 years ago by Charles Darwin, is the authoritative catalogue of the world’s flora and fauna. Because it has become untenably cumbersome in size, and therefore uneconomical to publish in book form, its publisher, the Oxford University Press, will henceforth publish it on cd-rom and thus be able to update it more regularly. “The Index Kewensis will be what we call a partially-structured textbase,” explains Abbink. “This means you can restrict your searches per volume, genus, family, or, for example, to date of first citation.”
    Diderot is a noteworthy project currently under way for Dutch publisher Het Spectrum and is partly funded by the Dutch government. Here, AND is developing a sophisticated yet flexible database system running on Sun workstations which will facilitate the publication of reference works by Het Spectrum. Diderot goes far beyond the simple flat-file paradigm of current lexicographical workstations by offering a true relational structure for lexicographic and encyclopedic data. End-users of Diderot can easily manipulate and view their data in a variety of ways, thanks to a graphical input-screen generator. The next step, says Abbink, will be to create an object-oriented semantic network across the database to provide an even more powerful tool for creating new reference works. Diderot is currently being used by a group of lexicographers at Het Spectrum offices in Utrecht for the production of dictionaries.
    Another tool which AND has nearly completed is summa, a content-scanning system designed for the automated production of abstracts and key word indexes. summa uses lexicons and tables of lexical data to analyze sentence structure and recognize discourse, with language-specific modules for annotation and generation of abstracts. Front-end tools allow users to index and establish topics and postedit where required. A prominent British insurer is and’s first customer for the system.

Oxford
Finally, there is the new cd-rom version of the Oxford English Dictionary (the first edition is nearly three-and-a-half years old), arguably the most prestigious electronic publishing project which AND has hitherto been involved. After extensive deliberation with the Oxford University Press, AND has implemented it as a Microsoft Windows’ application. It is currently in a late beta-testing stage and slated for either a late June or early September release. Demonstrating the package, Abbink and AND software engineers are justly proud of their efforts.
    The new oed is a milestone product, something people will want to acquire a cd-rom drive just to be able to use, Microsoft Windows notwithstanding. It is fast, and, thanks to the use of different colors for the different sections, the sometimes longish entries can be easily read or scanned. The software is smart, too, owing to the dictionary’s database structure. This means you can perform searches for etymologies, variant forms, citations, phrases, phonetics, Greek words (it displays the Greek as well), and even date ranges. You can also select what part of the entries you want to be displayed. Best of all, it contains all the information of the original twenty-two volume edition (550 mb of information compressed to 200 mb; the remaining space is used for performance-boosting indexes).
    “We spent four to five months just on the design stage,” says Abbink. “We spent a lot of time trying to determine how people use such things as etymologies and citations. We kept asking Oxford,‘What is useful? What do you want to be able to do with it?’ We said to them, ‘we’re not new to this business, but we do need a lot of input.’”
    Parsing the sgml-encoded file which Oxford supplied revealed some interesting things. Abbink: “When you analyze a text with the computer, you look at it in a totally new way. You see things you’ve never been able to see before.” Because so many different people have contributed to the oed over the years, slight inconsistencies have crept into such things as abbreviations and punctuation. Working together with Oxford, and regularized many of the deviations. “We had to tread carefully,” says Abbink. “This is, after all, the Holy Book of the English language,” he adds with a wink.

Query language
A lot of effort has gone into the package’s query language, called quest; the AND engineers currently putting the final touches on it. “quest will be especially interesting for researchers,” says Abbink. “You can formulate an extremely complex query with quest and let your system process it overnight.” quest is a pattern-matching language designed for natural language developed by linguists at the Free University in Amsterdam and the University of Utrecht for and’s remarkable electronic Hebrew Bible, yet another ongoing project. quest’s authors, adds Abbink, hope it will become a standard for querying natural language; he realizes, however, that creating de facto standards in the computer industry takes more than good intentions.
    The ball is now in Oxford’s court. Like many other well-established publishing houses, it has had difficulty deciding how to position and market such new products in the past, whether to treat them just as books or to tread into the treacherous waters of software marketing. Meanwhile, the AND team is already considering future enhancements to the oed, such as a dde interface to other Windows’ applications and possibly a network version based on a client/server design. cd-rom technology has its limitations, such as slow access and throughput speeds, but Abbink believes the technology will catch up — and catch on. “Early efforts in the field of electronic publishing were disappointing for users because they used standard text retrieval techniques,” says Abbink. “They were not intelligent, not efficient. They just used brute force. We’ve now moved beyond that. This is now a second generation technology.” Whereas the first electronic reference works generally offered a subset of the information contained in their printed counterparts, the new oed portends products which offer truly a superset of that information, not least through added functionality. Where Oxford goes, others will hopefully follow.

AND Software Westersingel 108, 3015 LD Rotterdam, The Netherlands Tel +31 10 436 7100, Fax +31 10 436 7110

COPYRIGHT © 1992 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]