THE CORE LEXICAL ENGINE:
THE CONTEXTUAL DETERMINATION OF WORD SENSE

James Pustejovsky

Computer Science Department
Brandeis University
258 Volen Center
Waltham, MA 02254

CONTACT INFORMATION

Prof. James Pustejovsky, Computer Science Department, Brandeis University, 258 Volen Center, Waltham, MA 02254,

email: jamesp@cs.brandeis.edu

phone: (617) 736-2709

fax: (617) 736-2741

WWW PAGE

http://cs.brandeis.edu/~jamesp/projects/corelex.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Lexicon Development, natural language processing, linguistic knowledge acquisition, lexical semantics

PROJECT SUMMARY

Our goal in this research is toward the robust contextual determination of word sense for natural language applications. This involves two related subgoals:

the construction of a Core Lexical Engine for diverse applications, domains, and languages.
the acquisition of new lexical entries and the refinement of existing ones for a particular domain or application, through statistically-based corpus acquisition methods.

The Core Lexical Engine consists of three major components:

(a) A core semantic typing system;
(b) Relational structures for all categories, organizing sortally specified arguments into structured objects;
(c) Generative mechanisms allowing extension and identification of word sense in context.

The theoretical aspects of the work have focused on consolidating these three components into a scalable formalization.

This research hopes to demonstrate the applicability of a model of lexical knowledge, to the task of the semi-automatic construction of a core lexical engine, making extensive use of text corpora. The information obtained from both machine-readable dictionaries and large text corpora is equally rich, representative, and important for populating a lexical knowledge base for natural language processing. We are developing a framework for semi-automated lexical acquisition making complementary use of both types of lexical resources.

The thrust of the project during the past year has been on both the theoretical foundations for the lexical representation language used in the project, known as Generative Lexicon (GL) Theory (Pustejovsky, 1991,1995), as well as on development of the acquisition tools for data mining over corpora. How the theoretical work ties together with strategies for automatic lexical acquisition from ``closed'' text corpora is also a central component of our efforts.

The project is now one and a half years underway, and we have completed the collection of corpora and the integration of acquisition and analysis tools. One of Dr. Pustejovsky's graduate students, Scott Waterman began the first Apple Summer Internship under Dr. Boguraev at Apple. The postdoctoral researcher for the project, Dr. Michael Johnston (from U.C. Santa Cruz) first worked as an intern at Apple together with Dr. Boguraev, and is now working as a postdoctoral researcher for Dr. Pustejovsky at Brandeis.

PROJECT REFERENCES

Boguraev, B. and J. Pustejovsky (1996) Corpus Processing for Lexical Acquisition MIT Press, Cambridge.

Pustejovsky, J. (1995) The Generative Lexicon, MIT Press, Cambridge.

Pustejovsky, J. (1996) "Lexical Underspecification for Semantic Forms," Folia Linguistica, 1996.

Pustejovsky, J. (1995) "Constraints on Type Coercion," in P. Saint-Dizier and E. Viegas (eds.), Computational Lexical Semantics, Cambridge University Press.

Pustejovsky, J. and P. Bouillon (1995) "Aspectual Coercion and Logical Polysemy," Journal of Semantics.

Pustejovsky, J. and B. Boguraev (1995) "Lexical Semantics in Context," Journal of Semantics.

Johnston, M., B. Boguraev, and J. Pustejovsky. (1995) "The Acquisition and Interpretation of Complex Nominals", in Working Notes of AAAI Spring Symposium on The representation and acquisition of lexical knowledge, AAAI, Stanford.

Pustejovsky, J. (1993) Semantics and the Lexicon, Kluwer Academic Publishers, Dordrecht.

Pustejovsky, J., S. Bergler and P. Anick. (1993) "Lexical Semantic Techniques for Corpus Analysis," Computational Linguistics, Special Issue on Corpus Linguistics, 19.2.

Pustejovsky, J. and B. Boguraev. 1993. "Lexical Knowledge Representation and Natural Language," Artificial Intelligence.

AREA BACKGROUND

The two areas most closely related to this project are (1) lexical knowledge representation and disambiguation, and (2) knowledge acquisition from corpora.

Concerning area (1): One of the most pervasive phenomena in natural language is that of lexical ambiguity, a problem for both language learners and natural language processing systems alike. The notion of context enforcing a certain reading of a word, traditionally viewed as selecting for a particular word sense, is central both to global lexical knowledge base design (the issue of breaking a word into word senses) and local composition of individual sense definitions. Most computational lexicon designs still reflect the `static' approach to dealing with this problem first proposed in the 1960s. Under this design, the number of and distinctions between senses within an entry are `frozen' into a fixed system's lexicon, and definitions cannot make provisions for the notion that boundaries between word senses may shift with context.

Recently, we have realized that there are serious problems with positing a fixed number of `bounded' word senses for lexical items. In a framework which assumes a partitioning of the space of possible uses of a word into word senses, the problem becomes that of selecting, on the basis of various contextual factors (e.g., selectional restrictions), the word sense closest to the use of the word in the given text. As far as a language user is concerned, the question is that of `fuzzy matching' of contexts; as far as a text analysis system is concerned, this reduces to a search within a finite space of possibilities.

This approach fails on several accounts, both in terms of what information is made available in a lexicon for driving the disambiguation process, and how a sense selection procedure makes use of this information. Typically, external contextual factors alone are not sufficient for precise selection of a word sense; additionally, often the lexical entry does not provide enough reliable pointers to critically discriminate between word senses. In the case of automated sense selection, the search process becomes computationally undesirable, particularly when it has to account for longer phrases made up of individually ambiguous words. Finally, the assumption that an exhaustive listing can be assigned to the different uses of a word lacks the explanatory power necessary for making generalizations and predictions about how words used in a novel way can be reconciled with their currently existing lexical definitions. It is this point that forces us to do more than simply tagging the words of the language with a semantic marker.

Concerning area (2): In computational linguistics research, it has become clear that, regardless of a system's sophistication or breadth, its performance must be measured in large part by the resources provided by the computational lexicon associated with it. The fundamental resource that go into a lexical item enable a wide range of morphological, lexical, syntactic, semantic and pragmatic processes to operate in the course of tasks such as language analysis, text processing, content extraction, document summarization, or machine translation. Lexical acquisition, therefore, has emerged as an essential phase in getting any realistic natural language processing system off the ground. This work began approximately 10-13 years ago with research aimed at leveraging the information compiled by lexicographers, which was then becoming available in the form of machine-readable dictionaries. We are now at a point where it is clear that, even with the lexical data available in those resources, there is a large number of different word classes which remain outside the coverage of a ``conventional'' dictionary; furthermore, there is information required by current computational systems, which is unavailable in machine-readable dictionaries in any case.

The fundamental problem of lexical acquisition, then, is how to provide, fully and adequately, the systems with the lexical knowledge they need to operate with the proper degree of efficiency. The answer, to which the community is converging today, is to extract the lexicon from texts themselves.

It has always been trivial to generate word lists by extracting isolated word forms in a text. Then, corpora with these forms tagged and syntactically annotated became available, giving us the basis for a new class of 'stochastic' parsers. Corpus processing techniques demonstrated, among other things, how certain categories of lexical properties could be identified by means of empirical studies of word occurrences in large bodies of text. For instance, paired corpora in two languages provide evidence that a lexicon could be induced from 'alignment' of texts which are translations of each other; word collocations can be distinguished from coincidental co-occurrences; semantic analysis of phrasal segments points to evidence for regular behavior of certain word classes; conversely, analysis of patterns of parsed (or otherwise structurally annotated) texts revealed the potential for deducing semantic information about lexical classes.

AREA REFERENCES

Boguraev, B. and J. Pustejovsky. (1994) "A Richer Characterization of Dictionary Entries,: in Automating the Lexicon, B. Atkins and A. Zampolli, (eds.), Oxford University Press.

Boguraev, B. and J. Pustejovsky (1996) Corpus Processing for Lexical Acquisition MIT Press, Cambridge.

Computational Linguistics (1993) Special Issues on Corpus Linguistics.

Miller, G. (1990) "WordNet: An on-line Lexic al Database," International Journal of Lexicography, 3, 235-312.

Miller, G. The Science of Words, (1991), Scientific American Press.