Human language records most of the information and knowledge produced by organizations and individuals. The machine-based process of analyzing information in natural language form is called natural language processing (NLP). Information extraction (IE) is the process of analyzing machine-readable text and identifying and collecting information about specified types of entities, events, and relationships.
Named entity extraction is an area of IE concerned specifically with recognizing and classifying proper names for persons, organizations, and locations from natural language. Extant approaches to the design and implementation named entity extraction systems include: (a) knowledge-engineering approaches which utilize domain experts to hand-craft NLP rules to recognize and classify named entities; (b) supervised machine-learning approaches in which a previously tagged corpus of named entities is used to train algorithms which incorporate statistical and probabilistic methods for NLP; or (c) hybrid approaches which incorporate aspects of both methods described in (a) and (b).
Performance for IE systems is evaluated using the metrics of precision and recall which measure the accuracy and completeness of the IE task. Previous research has shown that utilizing a large knowledge base of known entities has the potential to improve overall entity extraction precision and recall performance. Although existing methods typically incorporate dictionary-based features, these dictionaries have been limited in size and scope.
The problem addressed by this research was the design, implementation, and evaluation of a new high-performance knowledge-based hybrid processing approach and associated algorithms for named entity extraction, combining rule-based natural language parsing and memory-based machine learning classification facilitated by an extensive knowledge base of existing named entities. The hybrid approach implemented by this research resulted in improved precision and recall performance approaching human-level capability compared to existing methods measured using a standard test corpus. The system design incorporated a parallel processing system architecture with capabilities for managing a large knowledge base and providing high throughput potential for processing large collections of natural language text documents.
|Commitee:||Seagull, Amon, Simco, Greg E.|
|School:||Nova Southeastern University|
|Department:||Computer Information Systems (MCIS, DCIS)|
|School Location:||United States -- Florida|
|Source:||DAI-B 70/04, Dissertation Abstracts International|
|Subjects:||Artificial intelligence, Computer science|
|Keywords:||Computational linguistics, Entity extraction, Information extraction, Machine learning, Memory-based learning, Natural language processing|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be