This is a preliminary report on a system for word sense disambiguation (WSD) for unrestricted vocabulary, which requires no training on tagged text. Disambiguation is done to WordNet word senses. The “disambiguating power” of the system comes from three sources: (A) Parsing by English Slot Grammar (ESG), (B) the WordNet relation system, and (C) the WordNet sense frequency data. At training time, ESG is used to parse a large (training) corpus, producing a database of head-slot-filler tuples, with frequencies, and these serve as local contexts in the WSD. Using these slot-based local contexts, along with WordNet relations, we form the sense discriminator database for the corpus, which associates to each slot-based local context of a word w and each WordNet sense s of w, an evidence number for sense s of w based on that context. At runtime, each sentence of a text is parsed with ESG, and the following three things are done: (1) The slot-filler relations given by the parse are converted into keys for the sense discriminator database, and look-up in that database allows us to accumulate contextual evidence for each possible sense of each word. (2) Filtering of possible senses for each word is done through ESG's morphological analysis (especially for parts of speech). Also, ESG's analysis of phrasal verbs plays a role in deciding what WordNet entries to look up. (3) The WordNet sense frequency data (for the word forms and parts of speech determined by ESG) are used to feed into a WSD score that also uses the contextual evidence for senses. We describe a preliminary experiment and evaluation on this WSD system, where the accuracy was 72%. A secondary purpose of the report is to describe a kind of integration of WordNet with ESG, where ESG parse structures show WordNet senses chosen by the WSD system. This framework includes a new system of very fast API functions for WordNet, along with an efficient editor-based WordNet browser that allows easy exploration of WordNet by clicking on synsets or words in the result displays of queries.
By: Michael C. McCord
Published in: RC23397 in 2004
LIMITED DISTRIBUTION NOTICE:
This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.
Questions about this service can be mailed to email@example.com .