Houghton Mifflin’s Software Strategy Pays Off | This article orginally appeared in the May-June 1993 issue of Language Industry Monitor Just what is electronic publishing? And where do traditional publishers fit into this rapidly changing landscape? Many of the world’s publishers are asking themselves this pertinent question today. Their conclusions vary. For 150 year,old Boston publisher Houghton Mifflin, the move into electronic publishing has been a slow, deliberate process that has taken many years. While it has not yet published many titles in electronic forn, Houghton Mifflin has quietly built up unrivalled expertise in linguistic software engineering and extensive linguistic resources with which it has been enjoying growing commercial success. Houghton Mifflin’sgrammar checker, CorrectText CGS, is integrated in two of the best-selling Windows wordprocessing packages, Word For Windows and Ami Pro. And, at the spring Comdex, WordStar International introduced electronic versions for DOS, Mac, and Windows of Houghton Mifflin’s superb new Third Edition of the American Heritage Dictionary. These new electronic editions are the debut offerings of WordStar’s new MultiMedia Division and look likely to equal if not surpass the sales of earlier versions of the American Heritage offered by WordStar. With its in-house resources and close relationship with WordStar and other prominent software houses, the Software Division is superbly poised to help the company move some of its vast copyrighted holdings into the mushrooming world of new media. The venerable Boston publisher’s first brush with the digital revolution came in the mid 1960s, with the compilation of the American Heritage dictionary, the first dictionary to be both compiled and typeset with the assistance of the computer. The dictionary was based on frequency statistics elicited from the Brown Corpus, the million word corpus of American English compiled by pioneering linguists Henry Kucera and Nelson Francis at Brown University. This dictionary, published in 1969, was a resounding success and went on to become one of Houghton Mifflin’s best known reference titles. It was also the start of an enduring association with professor Henry Kucera, a linguist in the Department of Slavic Studies at Brown University. Upon completion of the American Heritage dictionary, Houghton Mifflin found itself with a computerized word list on its hands and asked itself the rhetorical question: What else can we do with this? There appeared to be interest in the software world for such materials; during the 1970s, periodic requests came in for wordlists in digital fonn. “And by the way,” Houghton Mifflin was asked, “do you know anyone that can write the software?” A small group at Houghton Mifflin herewith made its first excursions into linguistic software, working together with Kucera, who developed programs for spellchecking, morphological, and syntactic analysis. Kutera brought both his expertise as a professional linguist and his pragmatic nature to bear on the collaboration, resulting in programs which were both efficient and effective on desktop machines. The Software Division was established in 1982 and its first OEM customer also dates from that year. Initially, the division offered just wordlists for spellcheckers, but in 1985 it expanded its offerings to include an electronic thesaurus, phonetic matching, morphologizer, and frequency ranking. Hyphenation followed in 1986, the year the Division was fonnally spun offfrom the Reference division as an independent entity. Grammar checking followed in 1987, with foreign language spellchecking in 1988. In 1989, the Division introduced an electronic version of the American Heritage Dictionary. Houghton Mifflin now offers two broad classes of software: writing tools and reference work tech, nology. Its software runs on a variety of systems, including DOS, Macintosh, Windows, OS/2, Sun, VAX, Unix, and handhelds. Houghton Mifflin’s traditional publishing activities are centered squarely in the mass market, namely trade (publish, er,speak for general titles), reference, and educational materials. The Software division in contrast has eschewed the retail market, preferring to license its wares to third party developers (OEMs), who in turn incorporate the Houghton Mifflin software within programs or develop standalone products. The one incursion it has made into retail market, ing was the introduction in 1989 of an electronic version of the second edition of the American Heritage dictionary for the pc. The product was a success, says Director of Marketing Services John Riley, but the Software Division decided to license it to WordStar and concentrate on what it does best, OEM development and marketing. The Software Division currently numbers between seventy, five and hundred people. As Director of Research Win Carus, who joined Houghton Mifflin in 1982, explains, “we bring together three professions which usually don’t have much to do with each other: computer scientists, linguists, and lexicographers. Traditionally, they think in different ways. Here, they are working together in a single environ, ment.” In addition to many software engineers and linguists, the Division has, for example, compression and retrieval specialists, SGML experts, and a technical writer, who documents the OEM products. Then, of course, there is a sales and marketing staff. ”There are a lot of people who can write software,” points out Carus, “but few that truly understand linguistic matters. Frequently, it comes down to a conflict between theory and practice. Software must correspond to linguistic reality, otherwise it is not extendable.” Carus adds that English is the worst language to base software on; it is the odd one out. ”In terms of inflection, English and Chinese are more similar ‘than English and French. English is highly context depend, ent, and requires extensive work to handle noun clusters and particle attribution. Engineering sameness is not the same as linguistic sameness,” cautions Carus. Carus is well versed in the current trends of NLP research, but he is not always impressed with what comes out of research labs. “Many researchers have no conception of what it takes to write real working systems,” he declares. “It takes an enormous amount of work to develop a program which offers truly generic coverage. There are no dramatic break, throughs to be made here — just a lot of hard work.” Progress is slow; Systran, not Lotus 1,2,3, is the paradigm for the linguistic software world. “Because of the high development costs of linguistic software, we have to plan very carefully where we will deploy our resources,” adds Carus. This implies balancing the wishes of OEM customers with the Division’s own perceived goals — the two do not always overlap. Houghton Mifflin nonetheless stands to profit handsomely in the spiralling feature wars of the highly competitive mainstream software market, as linguistic enhancements are one of the obvious ways in which software packages wordprocessors in particular — can be improved. But as software increases in linguistic sophistication, a correspon, ding increase in understanding of their abilities cannot always be taken for granted. Says Product Marketing Direct’ or Anne Komer, “Grammar correction systems can be difficult products to evaluate. Reviewers tend to abandon a scientific approach and focus on their own grammatical pet peeves, resulting in inaccurate reviews.” To help people understand and judge the capabilities of packages like CorrectText, Komer would like to see an independent organization provide objective benchmarks for evaluating these programs. In August of last year, Houghton Mifflin published the third edition of the American Heritage Dictionary. It met with critical acclaim and popular success, enjoying a run of fourteen weeks on the New York Times non-fiction bestseller list. Its sleek black cover with gold lettering, its hefty format, and its crisp typography conspire to make this US$40 tome the Lexus of single volume dictionaries. Back, no doubt by popular demand, is the memorable section on Indo-European roots. “The third edition is a milestone in the history of our company,” says Director of Technical Development Kirby Mansfield. “It is the first reference work we have produced which is explicitly designed for multiple uses.” Adds John Riley, “whereas previously the Software Division was not really in synch with the Trade and Reference division, we were involved with the third edition from the very beginning.” Because this project also took the needs of the Software Division into consideration, the Division was able to have the electronic version ready in just three months for such OEM customers as WordStar International to implement as retail products. For Houghton Mifflin, the electronic version of the American Heritage is no longer a poor cousin to the paper edition. Like its predecessors, the third edition was compiled using the American Heritage corpus to assist in the analysis and description of in particular the core vocabulary (the first five thousand lemmas), which are by far the most complex lexically in the English language. From its inception, the American Heritage Dictionary has been a dictionary of contemporary usage with a balance of descriptivism — if a word is used it should probably be included and described and prescriptivism — how a word is used, i.e., whether there are any restrictions on its application, with additional support for the latter in the form of a 173-strong usage Panel. Houghton Mifflin uses its corpus for more than just compiling dictionaries of course. Without going into detail about the company’s corpus strategy, Carus says that it in broad lines entails collecting and accurately annotating balanced corpora for all major Western and Eastern Euro, pean languages and for some Asian languages. He emphasizes that statistical representativeness and annotational accuracy are extremely important to Houghton Mifflin; sheer bulk is not. “The corpus collection work by the US Data Collection Initiative and European Corpus Initiative are highly opportunistic,” says Carus. “They result in grab bags of text that are not representative. We feel that unless we start from a balanced, carefully annotated corpus, we cannot make accurate and safe generalizations about the core of the languages we work on, nor can we make statistically valid recommendations about how much more to add to our corpora when we need to deal with under-represented lexical, grammatical, or syntactic phenomena.” Houghton Mifflin has developed its own tools for tagging part-of-speech and constituent structures in corpora; heavily annotated corpora are extremely useful for testing parsers and other tools as well as compiling dictionaries. The company provided the Hector team at DEC (see page 12) with its software for tagging its 20 million word Oxford corpus, and the DEC group gave the Houghton Mifflin tagger high marks for its accuracy. Houghton Mifflin’s experience with corpora extend in other ways as well. Whereas the company’s first hyphenation routines were algorithm-based and, as such, were fast but not accurate, the second generation system, International Hyphenator, is a pattern,matching system that was developed by analyzing large corpora of hyphenated texts. According to Carus, this results in nearly flawless hyphenation in the fourteen national languages which the package supports. At the Hannover trade show CeBIT in March, the Software Division was out in force, with a team of nine spending seven long days staffing the Houghton Mifflin booth. More than just a matter of showing the flag, CeBIT is an important opportunity for the company to meet with existing customers and secure new ones. While many of its more than one hundred, fifty OEM customers are American software houses localizing products for foreign markets, Houghton Mifflin also counts a number of European companies as customers. A Dutch Atari developer introduced a new version of its wordprocessor at this year’s CeBIT which now includes Houghton Mifflin’s German spellchecker and Siemens Nixdorf recently licensed the German thesaurus a new office application. Keeping apace with the booming market for localized products, Houghton Mifflin has been steadily expanding its international offerings. With the major Western European languages well covered, the company is now making a concerted push into Eastern European languages, and recently introduced spellcheckers for Czech and Russian which it had compiled internally. Naturally, the Division called on Kucera, whose mother tongue is Czech, to help obtain the necessary word stock and to test the Czech spellchecker. This year’s CeBIT saw the Software Division launch its International Electronic Thesaurus, a series of eleven thesauri derived from materials licensed from a number of major European publishers. Compiling new linguistic resources is not simply a matter of repackaging wordlists but often entails converting typesetting tapes to machine tenable form and going through them with a fine-toothed comb to look for errors. Not all European publishers have taken the plunge with SGML, but Carus singles out Van Dale, the prominent Dutch publisher, as supplying accurately tagged material of very high quality; its Dutch thesaurus, for example, includes superordinates for many entries, a rare but very welcome addition. In May of this year, the Software Division moved across the Charles River from Cambridge to Boston. The move had its physical imperatives — projected growth — but also a symbolic dimension. The Software Division is now quartered in the same building as the Trade and Reference Division. Houghton Mifflin the book publisher will in the future be sharing more and more of its titles with Houghton Mifflin the software house, hence the need to be in closer physical proximity. Some titles are already available: the American Heritage dictionary has been joined by Roget’s II Thesaurus, the Dictionary of Cultural Literacy, Simpson’s Contemporary Quotations, and the Written Word, a grammar and style guide. Douglas Eisenhart moved from the trade book division of Houghton Mifflin to the Software Division last summer, and he is helping define the company’s electronic publishing strategy for non, lexical materials. He acknowledges that Houghton Mifflin has been moving into this arena with careful circumspection, both in terms of licensing its copy’ righted materials to others as well as issuing its own elec, tronic titles. According to Eisenhart, Houghton Mifflin gets regular offers from people wanting to develop electronic versions of Peterson’s venerable Field Guides, the Bible of bird watchers; Peterson’s is typically the kind of publication that seems ripe for multimedia exploitation. The company sees electronic versions of select titles supplementing paper, based editions and it is currently studying ways of using technology to add value to such titles. The bird watcher of the future, for example, might one day have Peterson’s on a home computer for learning digitally recorded bird calls as well as a paper for “the field.” Numerous issues confront publishers moving into electronic publishing. For example, which platforms should be supported? There are not just the familiar Mac-DOS-Windows formats to contend with but also a bevy of handhelds and optical disk formats now appearing. This is considerably more confusing than the two platforms — hardcover and paperback — with which the traditional publishing world has formerly had to contend. There are also interface issues to contend with. Obviously, for a mass-market product, a simple command line interface just won’t do. Publishers inevitably face conflicts of interest in reconciling the new and the old, and they will have to be flexible in adapting to this rapidly evolving landscape. For those publishers, like Houghton Mifflin, which are willing to face issues head,on, the future rewards will vastly outweigh the current risks. Houghton Mifflin, Software Division, 222 Berkeley Street, Boston, MA 02116,3764, USA; Tel + 1 617 351 3000, Fax +16173511115 COPYRIGHT © 1993 BY LANGUAGE INDUSTRY MONITOR
|