Dissertation/Thesis Abstract

A reference-set approach to information extraction from unstructured, ungrammatical data sources
by Michelson, Matthew, Ph.D., University of Southern California, 2009, 160; 3355406
Abstract (Summary)

This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Instead, I describe extraction using a "reference set," which I define as a collection of known entities and their attributes. A reference set can be constructed from structured sources, such as databases, or scraped from semi-structured sources such as collections of Web pages. In some cases, as I shown in this thesis, a reference set can even be constructed automatically from the unstructured, ungrammatical text itself. This thesis presents methods to exploit reference sets for extraction using both automatic techniques and machine learning techniques. The automatic technique provides a scalable and accurate approach to extraction from unstructured, ungrammatical text. The machine learning approach provides even higher accuracy extractions and deals with ambiguous extractions, although at the cost of requiring human effort to label training data. The results demonstrate that reference-set based extraction outperforms the current state-of-the-art systems that rely on structural or grammatical clues, which is not appropriate for unstructured, ungrammatical text. Even the fully automatic case, which constructs its own reference set for automatic extraction, is competitive with the current state-of-the-art techniques that require labeled data. Reference-set based extraction from unstructured, ungrammatical text allows for a whole category of sources to be queried, allowing for their inclusion in data integration systems that were previously limited to structured and semi-structured sources.

Indexing (document details)
Advisor: Knoblock, Craig A.
Commitee: Knight, Kevin, O'Leary, Daniel, Shahabi, Cyrus
School: University of Southern California
Department: Computer Science
School Location: United States -- California
Source: DAI-B 70/05, Dissertation Abstracts International
Subjects: Computer science
Keywords: Extraction from the web, Information extraction, Information integration, Reference set construction, Ungrammatical unstructured data
Publication Number: 3355406
ISBN: 978-1-109-13959-4
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy