Dissertation/Thesis Abstract

High-performance knowledge-based entity extraction
by Middleton, Anthony M., Ph.D., Nova Southeastern University, 2009, 317; 3352341
Abstract (Summary)

Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships.

Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b).

Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope.

The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents.

Indexing (document details)
Advisor: Mukherjee, Sumitra
Commitee: Seagull, Amon, Simco, Greg E.
School: Nova Southeastern University
Department: Computer Information Systems (MCIS, DCIS)
School Location: United States -- Florida
Source: DAI-B 70/04, Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Artificial intelligence, Computer science
Keywords: Computational linguistics, Entity extraction, Information extraction, Machine learning, Memory-based learning, Natural language processing
Publication Number: 3352341
ISBN: 9781109090802
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest