Dissertation/Thesis Abstract

Graph-based approaches to resolve entity ambiguity
by Pershina, Maria, Ph.D., New York University, 2016, 94; 10139524
Abstract (Summary)

Information extraction is the task of automatically extracting structured information from unstructured or semi-structured machine-readable documents. One of the challenges of Information Extraction is to resolve ambiguity between entities either in a knowledge base or in text documents. There are many variations of this problem and it is known under different names, such as coreference resolution, entity disambiguation, entity linking, entity matching, etc. For example, the task of coreference resolution decides whether two expressions refer to the same entity; entity disambiguation determines how to map an entity mention to an appropriate entity in a knowledge base (KB); the main focus of entity linking is to infer that two entity mentions in a document(s) refer to the same real world entity even if they do not appear in a KB; entity matching (also record deduplication, entity resolution, reference reconciliation) is to merge records from databases if they refer to the same object.

Resolving ambiguity and finding proper matches between entities is an important step for many downstream applications, such as data integration, question answering, relation extraction, etc. The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains, posing a scalability challenge for Information Extraction systems. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and to answer complex queries. However the efficient alignment of large-scale knowledge bases still poses a considerable challenge.

Various aspects and different settings to resolve ambiguity between entities are studied in this dissertation. A new scalable domain-independent graph-based approach utilizing Personalized Page Rank is developed for entity matching across large-scale knowledge bases and evaluated on datasets of 110 million and 203 million entities. A new model for entity disambiguation between a document and a knowledge base utilizing a document graph and effectively filtering out noise is proposed; corresponding datasets are released. A competitive result of 91.7\% in microaccuracy on a benchmark AIDA dataset is achieved, outperforming the most recent state-of-the-art models. A new technique based on a paraphrase detection model is proposed to recognize name variations for an entity in a document. Corresponding training and test datasets are made publicly available. A new approach integrating a graph-based entity disambiguation model and this technique is presented for an entity linking task and is evaluated on a dataset for the Text Analysis Conference Entity Discovery and Linking task.

Indexing (document details)
Advisor: Grishman, Ralph
Commitee: Davis, Ernis, Ji, Heng, Meyers, Adam, Sekine, Satoshi
School: New York University
Department: Computer Science
School Location: United States -- New York
Source: DAI-B 78/01(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer science
Keywords: Deduplication, Entity disambiguation, Entity linking, Entity matching, Reconciliation, Record linking
Publication Number: 10139524
ISBN: 978-1-339-95009-9
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest