Viewpoint: The case for natural numerics


This article orginally appeared in the May-June 1992 issue of Language Industry Monitor

“You wouldn’t believe the horrors of what people have done with their databases. Storing dates as strings, for example. That makes it virtually impossible to calculate date ranges” (Jerrold Ginsparg, chief scientist at NLI, Berkeley, quoted in LIM no. 8). Steeped by the tradition of good programming practices, our first reaction might be to whole-heartedly agree. At second thought, this remark suggests, however, that our computers are unable to perform any numeric computation if the operands are embedded in a string of text.
    These days, with almost weekly tv documentaries on neural networks, ai, and Virtual Reality, it is becoming increasingly difficult to explain to an educated layperson why we cannot teach computers to handle free-format calendar dates. The truth, of course, is that we can. So why on earth haven’t we done it yet?.
    First of all, it is harder than you might suspect. Free-format information brings with it a fuzziness that we are not yet quite used to in the software industry. Explicit rules and data types have been our guide rails for decades. Without them, things can get perilous. It is illustrative that halfway into the 1980s nobody less than Roger Schank, known for his epoch-making ai research on conceptual dependency at Yale, was called upon to solve a very mundane problem: decoding all the free-format money transfer orders a Brussels bank was receiving by telex every day. In other words, problems like this may well exceed the intellectual capacity of the average software development shop.

Extralinguistic phenomena
However, attention from scholars and scientists to these mun dane problems is only sporadic (despite the fact that they may be well paid for it!). In the leading university centers, calculating date ranges surely does not sound highbrow enough for Phd students. At coling ‘88, Carnegie Mellon’s Masaru Tomita shrewdly observed that computational linguists tend to direct their energy to theoretically interesting aspects of “linguistic” sentences of the John-hit-Mary type rather than to the practical importance of “real” language in which, among other things, date and time expressions prevail.
    Has robust identification and handling of extralinguistic text elements (numerals, punctations, parentheses, special signs, etc.) fallen between ship and shore?.
    All technological signposts point into the direction of free-format entry of information. Naturally, this includes textual fragments in which numeric information such as account num-bers, inventory data, addresses, etc. are embedded. It would be insane to accommodate natural language input, ie, flexible vocabulary and word order, if at the same time we kept insisting on separate and fixed-format numerics.
    Meanwhile, developers of commercial MT systems have learned this lesson, being forced to equip their systems with the prerequisite preprocessing facilities. And more recently, as part of its eurotra follow-up program (“pre-competitive application prototyping”), the European Commission contracted a consortium lead by the sema Group Belgium to focus on so-called low-level, sub-linguistic phenomena. This project, the et-6/3 Text Handling Design Study, addresses the recognition of logical document structure (largely inspired by SGML) and layout char-acteristics, but also concerns the identification of such constructs as numbers, calendar dates, acronyms, personal names, geographical names.
    However, extralinguistic textual technology has a broader application potential than MT or even classic NLP. As text has entered the mainstream of computing (as Will Manis asserts in lim no. 7), new operating-system functions and software agents will be required to look inside the textual medium. Word processors do this now only to a very limited extent (wordwrap, hyphenation, and spellchecking). Innovative human-computer interfaces will need to constantly monitor the user’s activities, in order to help her or him by taking over boring routine tasks (numbering, reformatting, and rearranging pieces of information), signalling apparent mistakes, etc.

No numerical knowledge
Though today’s word processors often have a calculate facility for tabular data, and although they’re getting more and more integrated with spreadsheet and presentation packages, their knowledge of mundane numerical things is still in a deplorable state. When you are using your word processor, your computer hardly “knows” that 3 follows 2, that 1b follows 1a, that IX precedes X, that a telephone number cannot have 500 digits, to name a few things. Its knowledge about counting is, at best, “compartmentalized,” available only when explicitly instructed to add or to subtract. Similarly, its knowledge about calendar dates is not operative when it is scanning your input text stream.
    The trouble with decoding free-format information, whether dates, money transfer orders or whatever, is that there’s always a percentage of cases that can’t be handled fully automatically. Put differently: disambiguation isn’t always possible by means of the surrounding context and the machine’s knowledge alone. Sometimes, the human user must be consulted via the human-computer interface. This requirement complicates the construction of intelligent software: it is the strength but also the weakness of the prospective software agents, and may well have prevented their successful deployment until now.
    There are several factors giving hope that a “horror-less” future for NLP will not be too distant. The most important development perhaps is SGML, which finally payed tribute to the logical structure present in every document. Throughout the middle ages preceding it, computers treated text as just one long character string (its maximum length being solemnly specified as 65,536 bytes or so in software manuals). Furthermore, the general trend toward object-oriented technology may contribute to appropriate handling of extra-linguistic textual elements. And finally, increased attention can be expected from the human-computer interface community, which represents a rich variety of disciplines and trades and consequently offers fresh new attitudes and approaches compared to those of seasoned computer pro-grammers or computational linguists.
    As for the latter: unless they convert themselves to language engineers or text technologists and no longer resist dealing with the theoretically “uninteresting” aspects of real sentences, they will gradually isolate themselves. Accepting that researchers and experimenters from other disciplines are already making inroads on language technology at the sub-linguistic level, computational linguists should focus on similarly pragmatic pattern recognition functions at the (lower) linguistic levels. Otherwise, they may loose their linguistic grounds too.

— Tony Whitecomb

Tony Whitecomb was recently appointed part-time professor of cognitive ergonomics at the Technical University of Delft.

COPYRIGHT © 1992 BY LANGUAGE INDUSTRY MONITOR

[ return | top | home ]