Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications including question answering, machine translation, and multi-document summarization. In linguistics, paraphrases are characterized by approximate conceptual equivalence. Since no automated semantic interpretation systems available today can identify conceptual equivalence, paraphrases are difficult to acquire without human effort. The aim of this thesis is to develop methods for automatically acquiring and filtering phrase-level paraphrases using a monolingual corpus.
Noting that the real world uses far more quasi-paraphrases than the logically equivalent ones, we first present a general typology of quasi-paraphrases together with their relative frequencies. To our knowledge the first one ever. We then present a method for automatically learning the contexts in which quasi-paraphrases obtained from a corpus are mutually replaceable. For this purpose, we use Relational Selectional Preferences (RSPs) that specify the selectional preferences of the syntactic arguments of phrases (usually verbs or verb phrases). From the RSPs of individual phrases, we learn Inferential Selectional Preferences (ISPs), which specify the selectional preferences of a pair of quasi-paraphrases. We then apply the learned ISPs to the task of filtering incorrect inferences. We achieve an accuracy of 59% for this task, which is a statistically significant improvement over several baselines.
Knowing that quasi-paraphrases are often inexact because they contain semantic implications which can be directional, we present an algorithm called LEDIR to learn the directionality of quasi-paraphrases using the (syntactic argument based) RSPs for phrases. Learning directionality allows us to differentiate the strong (bidirectional) from the weak (unidirectional) paraphrases. We show that the directionality of the quasi-paraphrases can be learned with 48% accuracy. This is again a significant improvement over several baselines.
In learning the context and directionality of quasi-paraphrases, we have encountered the need for semantic concepts: Both RSPs and ISPs are defined in terms of semantic concepts. For learning these semantic concepts from text, we use a semi-supervised clustering algorithm HMRF-KMeans. We show that compared to the commonly used unsupervised clustering approach, this algorithm performs much better. Applying the semi-supervised clustering algorithm to the task of discovering verb classes, we obtain precision scores of 54% and 37% and corresponding recall scores of 53% and 38% for our two test sets. These are large improvements over the baseline.
We next investigate the task of learning surface paraphrases, i.e., paraphrases that do not require the use of a syntactic interpretation. Since one would need a very large corpus to find enough surface variations, we start with a really large but unprocessed corpus of 150GB (25 billion words) obtained from Google News. We rely only on distributional similarity to learn paraphrases from this corpus. To scale paraphrase acquisition to this large corpus, we apply only simple POS tagging and randomized algorithms. We build a paraphrase resource containing more than 2.5 million phrases. In the resource, 71% of the quasi-paraphrases are correct.
Having learned the surface paraphrases, we investigate their utility for the task of relation extraction. We show that these paraphrases can be used to learn surface patterns for relation extraction. The extraction patterns obtained by using the paraphrases are not only more precise (more than 80% precision for both our test relations), but also have higher relative recall compared to a state-of-the-art baseline. This method also delivers more extraction patterns than the baseline. Applying the learned extraction patterns to the task of extracting relation instances from a test corpus, our system takes a hit in relative recall as compared to the baseline, but results in a much higher precision (more than 85% precision for both our test relations).
Finally, we use paraphrases to learn patterns for domain-specific information extraction (IE). Since the paraphrases are learned from a large broad-coverage corpus, our patterns are domain-independent, making the task of moving to new domains very easy. We empirically show that patterns learned using (broad-coverage corpus based) paraphrases are comparable in performance to several state-of-the-art domain-specific IE engines.
Thus, in this thesis we define quasi-paraphrases, present methods to learn them from a corpus, and show that quasi-paraphrases are useful for information extraction.
|Commitee:||Hobbs, Jerry, Knight, Kevin, McLeod, Dennis, O'Leary, Daniel, Pantel, Patrick|
|School:||University of Southern California|
|School Location:||United States -- California|
|Source:||DAI-B 70/08, Dissertation Abstracts International|
|Subjects:||Linguistics, Artificial intelligence, Computer science|
|Keywords:||Information extraction, Learning, Paraphrases, Patterns, Selectional preferences|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be