Automatic Information Fefining: a Reuters Success Story

By Andrew Joscelyne


This article orginally appeared in the May/Jun 1991 issue of Language Industry Monitor

September 1989 saw the introduction by Reuters Ltd, the world ’s leading electronic publisher for the financial and business community, of its Topic Identification System (TIS) in its UK offices. TIS automatically classifies incoming news articles for the company ’s online database, Country Reports.

Originally called CONSTRUE (Categorization of News STories, Rapidly, Uniformly and Extensibly), this AI-based system was developed by the Carnegie Group (formerly Language Craft) in Pittsburgh. TIS was designed to improve the throughput of story classification, not only by automating activities previously done manually, but also by going beyond the imprecision of conventional keyword search techniques.
    Although it is not strictly-speaking a natural language processing rig, TIS nonetheless handles a crucial info-tracking task using a text-sensitive expert system. According to Reuters Special Projects Manager Stephen Weinstein, the system is currently delivering a story classification range of 674 distinct categories (eg, currency exchange, commodities, people, countries, company names, etc.). It takes an average 4.36 seconds to process a story, with a mean accuracy of nearly 90%, thus already exceeding the original project specifications.
    Before turning to TIS, Reuters had more than a hundred university graduates in its Finsbury Data Services office categorizing the stories manually, with a 30% staff turnover annually, an 85% accuracy rate of classification, and a hefty payroll. TIS saved Reuters some $752,000 in 1990 and should mean an annual reduction in costs of over $1.2 million in 1991. What took 6 days in the past, the system can now do in two minutes. TIS, it seems, is delivering the goods.
    Reuters ’ story classification involves non-obvious categories. Information about dollar to yen evaluation, for example, must not be classified into the same category as a yen to dollar evaluation story. Since Reuters positions itself on a market where added value in information services means providing highly targeted knowledge to time-conscious end users, any classification system based on standard keyword searches would return too much irrelevant information, thus requiring further ’refining ’ by the client. TIS was therefore developed to add value to information retrieval by sorting stories into the right topical fields, rather than categorizing them on the basis of simple word lists.
    Initially, the TIS engine makes a concept recognition pass using weighted word and phrase arrays that define a concept based on synonyms, etc. Then, the system applies a set of local categorization (if-then) rules to refine the classification further. The first process might allocate a ‘dollar ’ story to the right financial category; the second might assign a default rule that ‘dollar ’ is US currency unless there is a ‘country ’ match (eg, Australia). Rule-base refinement and consistency checking, concept editing, and other housekeeping chores are handled by non-technical staff at Reuters on the TIS workbench (based on Lucid Common Lisp), which requires minimal technical expertise.
    Following the successful implementation of TIS in the Country Reports database, Reuters began using TIS in December, 1990, to process Textline, its global full-text database. This has meant the company can offer an even wider range of English language news stories. If the evaluation is positive, the next step will be to apply the classification engine to non-Reuters sources, scanning a range of electronic feeds to provide an ultra-rapid real-time info service to business users. According to Stephen Weinstein, this will entail a radical upgrading of the rule-base. Meanwhile, Carnegie Group’s original CONSTRUE engine has been generalized into a Text Categorization Shell (TCS), now available for developing other applications to classify full-text or generate documents from indexes.
    In the long run, Reuters may add further value to its information service by including additional functionality to the story transmission system. Weinstein suggests the possibility of tagging stories with SGML codes to allow end-users to further process downloaded texts. He stresses the costs involved in adding any such standard document structure mark-up codes; such a shift would have to reflect substantial market demand to make the additional effort to worthwhile.
    Its success with TIS in automating one process in the data flow has lead Reuters to consider information technology to smarten up other key activities. Translation, for example, from sixteen languages into English is a major source of news stories for the service, and plans are underway to develop a interactive translator ‘s workstation. Another potential for language processing is that of information packaging: could you shape into story form facts plucked rapidly from databases using an expert system? The challenge, in other words, will be to develop a rule-based, facts-to-text interpreter to provide further services for the information-hungry business world.

Reuters Ltd. 85 Fleet Street, London EC4P 4AJ, UK; +44 71 250 1122
Carnegie Group Inc., 5 PPG Place, Pittsburgh, PA 15217, USA

COPYRIGHT © 1991 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]