Oracle’s Linguistic Gold Mine


This article orginally appeared in the Mar-Apr 1993 issue of Language Industry Monitor

Over the past several years, Oracle has quietly built up a substantial linguistic software development group under the direction of Brett Newbold. Their first product is Oracle CoAuthor. There are more to come.

This year’s attendees of Esther Dyson’s pricey PC Forum in February got a peak at some surprising new linguistic technology coming from – of all places – Oracle, the database company. At the pc Forum, Brett Newbold and Kelly Wical of Oracle’s Advanced Text Products group demonstrated PowerRead, an impressive content reduction system based on Wical’s Text Engine, an enormous 40-megabyte conceptually ordered lexicon. PowerEdit, the ambitious grammar checker for pcs that was launched two years ago by a previously unknown Texas company, was also based in part on the Text Engine. Oracle acquired Artificial Linguistics last summer.

“When I first heard about PowerEdit, I thought it couldn’t be for real,” relates Brett Newbold, director on Advanced Text Products. “I’ve been in this field a long time and new systems don’t just pop out of the woodwork. Well, I was wrong.” According to Newbold, Artificial Linguistic’s Kelly Wical had been working since 1975 on a text understanding system and supported this Herculean effort by programming minicomputer applications. Wical was good enough at what he did that he could stipulate in his contracts that his customers support his other ongoing long-term project. Artificial Linguistics acquired some external financing in the late 80s, but this brought with it the necessity of a concrete product development plan. The company was languishing, torn between marketing commitments and ongoing development, when Oracle intervened last year and freed Wical and his group to concentrate on further refining their technology.
    “I thought PowerEdit was a good product, but this is what we really acquired the company for,” Newbold says, thumping on a four-inch thick printout of the Text Engine database. He calls it a connotative network. Whatever that implies, having the Text Engine is for Newbold akin to a miner sitting on a gold mine; he sees it as a rich resource for producing a broad range of linguistic software applications. While we may never get a good look at the Text Engine itself, PowerEdit and now PowerRead lift the tip of the veil by suggesting some intriguing possibilities for putting it to use.
    Unlike PowerEdit, PowerRead will not be offered as a shrink-wrapped package in the immediate future. Rather, it is initially intended as a software tool for enhancing other applications. Message routing, automatic indexing and keyword abstraction, information retrieval – these are just some of the uses which come to mind for PowerRead. Because of the inherent size of PowerRead and the underlying Text Engine, you will more likely find it on a network server than on a standalone workstation, but this fits in well with Oracle’s avowed client/server strategy for the nineties.
    To demonstrate PowerRead, the Oracle group developed a simple prototype application running under Windows. With it, you can read a text in and have PowerRead analyze it. By means of a horizontal scrollbar at the bottom of the window, you can adjust the level to which a text is reduced. To watch this happen onscreen is fascinating; PowerRead actually condenses your text down to the bare bones of a few keywords. Esther Dyson wrote a delightful account in her newsletter, Release 1,0, about putting PowerRead through its paces. Watching PowerRead take multiple passes through her sample text, separating the verbal wheat from the chaff, she exclaimed, “we’d love to try it on some press releases!” We know how Dyson feels; for our part, we think the package could work miracles in some of the Usenet newsgroups.

Newbold joined Oracle three years ago, after a seven year sojourn in Switzerland, where he helped get alps’s European office off the ground and had founded Lexpertise (developer of MacProof) together with fellow American Mike Anobile. Upon returning to the usa, Newbold says he was burned out on linguistic software and signed up with Oracle originally as an international sales manager – he speaks fluent French and Spanish. Soon, however, he was expounding his ideas about linguistic software and multilingual documentation issues and caught the ear of Oracle’s prescient founder and ceo, Larry Ellison. “We hit it off extremely well,” says Newbold of Ellison. “LJE is a visionary who has a lot of ideas about where things are going in this industry.” (Just don’t get him started on transaction benchmarking, though.) For his part, Newbold’s long stints at alps and Lexpertise provided him with both the technical savvy and proven project management skills required to oversee the development of products. As his former business colleague Mike Anobile points out, “don’t forget that in its heyday alps had some of the best computational linguists in the business. Brett learned a lot from them. He has a facility for getting right to the key issues involved.” Newbold himself singles out computational linguists Ken Beesly, Beat Buchman, and Daryl Londsdale as some of the brightest lights from his alps days.
    From the outset, Newbold did not want Oracle’s efforts in this field to be an interminable loss post for the company. Newbold has a keen appreciation of the possibilities and limitations of current technology; his group of fifteen or so linguists and computational linguists at Oracle was assembled to develop products within the window of possibility offered by today’s technology. “We’re not trying to achieve technological breakthroughs,” says Newbold. Oracle — isn’t that just a database company? No, in fact this highly successful software company is gradually moving into the lucrative arena of general office applications, otherwise known as “Bill’s turf.” The Office Automation Group is working on such applications as electronic mail (Oracle recently landed a million user licence with a major telecom supplier). Meanwhile, the Oracle database is at the heart of countless different kinds of applications. The company is not doing it solo; the Oracle rdbms, for example, enjoys a particulary warm symbiosis with the document production system Interleaf. The two open-ended systems complement each other well. Relational databases are not known for their text formatting and layout capabilities; conversely, document production systems do not offer extensive file management. Within Interleaf’s Relational Document Management (rdm) system, the Oracle database provides the infrastructure for maintaining complex collections of data, such as for example, maintenance data for aircraft, which Interleaf can use to produce compound documents. The key here is that the data is not locked into a given document format but can be reused in various ways. Diverse types of documents can be produced using subsets of the data, for a wide variety of distribu tion media. A good example are the electronic documents that can be produced using Interleaf’s WorldView. With rdm, you can also produce and maintain multilingual editions of documents which share the same illustrations. The Oracle database provides control over the diverse data elements, multiple users, and versions; Interleaf pulls it all together and packages it. You can more or less sum it up with one phrase: database publishing.
    Oracle is not new to the international market, claiming that it does business in ninety-eight countries. Oracle 7, the long-awaited new version of the Oracle rdbms offers seamless multi lingual support with its extensive support for alternate character sets and it is designed for localization from the ground up. Newbold, whose long stint in Europe obviously left him keenly aware of the enormous multilingual documentation burden faced by companies operating internationally, has clearly found a receptive and understanding audience within the Oracle organi zation, which has its own huge documentation headaches. “We’d like to provide some basic tools to assist this process,” explains Newbold. “Things for managing terminology, checking syntax – basically creating translation-aware docum ents. Oracle would like to be able to offer womb-to-tomb multi lingual documentation support.”

Oracle nonetheless remains close to its core business: sql databases. In Ellison’s vision, the future lies in huge, distributed databases running on multiprocessor machines built out of basically the same commodity hardware parts as desktop workstations, just more of them. Oracle 7, launched earlier this year, goes a long way to making transparently distributed databases a reality. Need more processing or storage capabilities? Just pop another Intel-based server in your network. Ellison believes these databases of the future – what he calls “public databases” – will contain one third conventional “tabular data,” that is database or spreadsheet data, one third structured objects, including things like cad/cam drawings and object-oriented data, and one third unstructured text. For the last category, it would obviously be useful to have some content-oriented, linguistically sophisticated technology on hand to make that data more manageable. That is where the Oracle’s Advanced Text Product group comes in.
    PowerRead, however, is not the first effort of the combined Oracle team to see the light of day; that honor goes to CoAuthor, Oracle’s text-proofing package for Interleaf. The product enjoyed a quiet launch late last year; since then, Oracle has not been promoting it noisily but rather discretely circulating it among interested parties. Interleaf – not Oracle – is the distributor of CoAuthor. In fitting with Newbold’s pragmatism, CoAuthor does not promise the sun and deliver the moon, but rather serves as a modest first step into the public arena for this young applications group. Newbold is nonetheless very proud of it. “The program’s interface and its dictionary management tools set CoAuthor apart,” he declares. “With CoAuthor, we started from scratch. No baggage.” Newbold brought knowledge and experience from alps and Lexpertise, but not the burdens of an existing product line.
    “The problem with packages like Grammatik and RightWriter is that the interface is so bad,” Newbold says. “You can tell the developers don’t use it themselves everyday.” While not pretending to be a grammar-checker – the package performs only “local parsing” – CoAuthor goes further than ordinary spellcheckers, and can best be described as a proofreading tool, because it flags such things as punctuation errors, incorrectly hyphenated or compounded words, and verbose phrases. Rather than dropping you directly into your text and beginning a word-by-word correction process, CoAuthor takes a preliminary pass over the text and supplies you with a checklist of errors, allowing you to ignore or correct errors collectively in advance. A morphology component ensures that replacement words are correctly conjugated. A nice touch is Auto-Correct, a list of your commonly misspelled words which CoAuthor can correct for you automatically. These are all attractive features that make the standard spellcheckers built into packages like WordPerfect seem primitive by comparison.
    One of the most intriguing aspects of CoAuthor is its support for Simplified English, the language of airplane maintenance manuals. Not performing a full syntactic parse, CoAuthor obviously cannot come close to guaranteeing that a text is certifiable Simplified English, but it is the only such product on the market. Says Newbold, “there’s not a large market for Simplified English checking. But companies that need it are desperate for it.” Presumably CoAuthor is extendable and can be configured for different kinds of restricted language. Perhaps one day we will see the package as a pre-processor for a machine translation system, for example.

CoAuthor is available and is now being used by Interleaf users. Who knows, however, when we will ever see PowerRead in a real-world application. If CoAuthor is anything to go by, Newbold is not likely to oversell this intriguing tool as he investigates various channels for marketing it, including the oem one. Ellison, meanwhile, is thinking ahead. Earlier this year, he was discussing the future of the database business with a group of corporate investors and touched on the idea of a “custom newspaper.” “I don’t have time for a regular newspaper,” he declares, figuratively addressing his newspaperman. “I just want a collection of stories fed to me each morning about the topics I’m interested in. And it better not cost much more than your regular newspaper – let’s say a dollar. And, by the way, you better keep up with my interests, too. They change.” Idle speculation?

[ return | top | home ]