This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Instead, I describe extraction using a "reference set," which I define as a collection of known entities and their attributes. A reference set can be constructed from structured sources, such as databases, or scraped from semi-structured sources such as collections of Web pages. In some cases, as I shown in this thesis, a reference set can even be constructed automatically from the unstructured, ungrammatical text itself. This thesis presents methods to exploit reference sets for extraction using both automatic techniques and machine learning techniques. The automatic technique provides a scalable and accurate approach to extraction from unstructured, ungrammatical text. The machine learning approach provides even higher accuracy extractions and deals with ambiguous extractions, although at the cost of requiring human effort to label training data. The results demonstrate that reference-set based extraction outperforms the current state-of-the-art systems that rely on structural or grammatical clues, which is not appropriate for unstructured, ungrammatical text. Even the fully automatic case, which constructs its own reference set for automatic extraction, is competitive with the current state-of-the-art techniques that require labeled data. Reference-set based extraction from unstructured, ungrammatical text allows for a whole category of sources to be queried, allowing for their inclusion in data integration systems that were previously limited to structured and semi-structured sources.
|Advisor:||Knoblock, Craig A.|
|Commitee:||Knight, Kevin, O'Leary, Daniel, Shahabi, Cyrus|
|School:||University of Southern California|
|School Location:||United States -- California|
|Source:||DAI-B 70/05, Dissertation Abstracts International|
|Keywords:||Extraction from the web, Information extraction, Information integration, Reference set construction, Ungrammatical unstructured data|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be