This article orginally appeared in the Jan/Feb 1993 issue of Language Industry Monitor Like most computer companies, Sun Microsystems is aware that the future of the computing industry lies in software. At Sun's East Coast labs, a group led by Bill Woods is investigating ways of improving information retrieval techniques. Sun, the leading manufacturer of Unix-based workstations, is by no means new to software. Its phenomenal growth rate and consistent successes may be as much due to software tools and system enhancements, such as the industry-standard Network Filing System (nfs) protocol, which it strategically cast into the public domain, as to its more bang-for-the-buck boxes. This past year has seen Sun, with the launch of its system software subsidiary, SunSoft, and its new “shrink-wrapped” operating system Solaris, preparing to lock horns with Microsoft over the future of the lucrative workstation operating system market. Sun spun off its Unix system software development group in July of 1991 as a wholly-owned subsidiary of Sun Microsystems. At that time, Sun also spun off development tools, networking, printing and some other software applications under the auspices of Sun Technical Enterprises (STE). In addition, Sun Microsystems Laboratories, Inc. (SMLI) was established as a corporate r&d organization with longer-term research objectives in both software and hardware. Within the new “planetary” organization, most of the traditional systems and applications development continues in Mountain View, California. However, at “SMLI East,” in Chelmsford, Massachusetts, one group is delving into more esoteric areas of natural language processing and information retrieval. SearchIt SunSoft now offers a similar full-text retrieval utility called SearchIt with the Solaris operating system software. Like ON Location, SearchIt helps users quickly find information in large collections of files by using a full-text index of their contents. According to Woods, “Even when you have lost track of some information and forgotten the name of the file containing it, you can usually think of a significant word or two that will retrieve it.” SearchIt uses standard, state-of-the-art information retrieval techniques to index and retrieve any material you may have on your disk or in your network. These techniques include ranking results for estimated relevance and allowing Boolean operators and proximity operators in queries. In addition, SearchIt uses morphological variation routines developed by Woods, Hays, and Ambroziak to make searches more effective by searching for morphological variations of words instead of just specific forms. For example, if the user enters the word find, it will also look for finds, finding, found, and so on. SearchIt is already useful, but Woods and his team are looking for ways to improve it further. Often, when searching for information, people fail to find what they need because the terms used in the query are not the same as those used by the author. This is especially true when the material being searched was not written by the person searching. Sun's knowledge technology research group is working on techniques for indexing material by concept rather than by individual words so that variations of a concept expressed in different words can be related to each other and found efficiently. While there are systems that allow users to define concepts by manually linking and weighting synonyms and related terms, the semantic structure of language is not yet well enough understood to make it possible to automate this kind of task. The group's goal is to change that state of affairs. Going beyond keywords _Our experience with AnswerBook has shown that users typically encounter two kind of results,” explains Woods. “The first, which is highly successful, is when the query is based on specific, correctly used keywords. In this case, you know exactly what you're looking for and you can look it up directly. AnswerBook works beautifully for that. _The second kind of result is when you don't know exactly what you are looking for or at least don't know what terms to use. In those situations, standard information retrieval techniques start to falter. It would be nice to be able to search for something without knowing the precise vocabulary used to refer to it, to be able to search for a concept rather than a specific word or phrase. For example, when a colleague was trying to remember how to get the Sun File Manager to display more than one folder, we searched, using the query file manager, multiple windows, but didn't find what we wanted. The section we finally found contained the phrase display the contents of more than one folder at a time. This is a case in which some form of conceptual retrieval might be useful. The phrase says display instead of windows and more than one instead of multiple. The ability to recognize the relationship between different phrasings that say essentially the same thing is one of the goals the group is working towards.” In a well-ordered world, objects and ideas would be logically mapped into a tree-like hierarchy of classifications and we could move at will from specific concepts to more general concepts as well as laterally from one related meaning to another. In practice, however, items of knowledge rarely have a single unique place in such a hierarchy. This poses a substantial challenge that the knowledge technology group is actively addressing. “We're trying to develop a taxonomy for world knowledge,” explains Woods, “one in which anything you want to find can be quickly and automatically located.” Bouncing back from abstract to concrete, Woods offers the salient example of attempting to find a company in the Yellow Pages that could steam clean his car prior to rustproofing it to resist the salt-strewn roads of the icy New England winter. “I looked for steam cleaning, automobile steam cleaning, and automobile cleaning with no success. I finally found automobile washing and polishing, which said see car washing and polishing. I'd like a system that knew enough about language and the meanings of words to know that car washing is a kind of automobile cleaning and then lead me directly to the relevant sections. Notice that the issue here is not just substituting an equivalent term for another. Usually one of the terms in a query is actually more general than the other and the system should know what to do about it. For example, if I ask for information about vehicles, I'd be happy to get information about cars, trucks, and buses, but if I ask for trucks, I'd be unhappy to have cars and buses included.” Word sense “When you compare syntax-based systems with semantics-based ones,” adds Hays, “you discover that syntactic systems just aren't adequate.” The latter are far more difficult to build, however. Coding a general-purpose knowledge base is a Herculean effort, something no one wants to do if someone else has already done it. Hays says that they have been looking into the possibility of using an existing resource, such as WordNet, an eighty-thousand word network of English nouns, verbs, adjectives, and adverbs and relationships between them developed by George Miller of Princeton University, a psychologist who has been studying the relationships between words and their meanings for many years. “A problem with natural language interfaces,” philosophizes Woods, “is that natural language has evolved for communication between intelligent creatures. Since databases by themselves are stupid, you miss most of the benefit when you try to talk to them in English. Natural language interfaces will become truly useful when we have smart agents that can understand what we want and interact with databases for us, applying knowledge and exercising common sense in the process. This is especially true when we're working with natural language data or when a data base consists largely of free-text fields. Increasingly computers are being used to store and access knowledge expressed in natural language and it is becoming increasingly apparent that we need new techniques to help people deal with this information. Developing the kind of technology it takes to do this is what the knowledge technology group at Sun is about.” Sun Microsystems Laboratories, Inc., 2 Elizabeth Drive, Chelmsford, MA 01824-4195, usA; Tel +1-508-442-0468 |