COMING SOON! PQDT Open is getting a new home!

ProQuest Open Access Dissertations & Theses will remain freely available as part of a new and enhanced search experience at

Questions? Please refer to this FAQ.

Dissertation/Thesis Abstract

Assembling Information from Big Corpora by Focusing Machine Reading
by Noriega Atala, Enrique, Ph.D., The University of Arizona, 2020, 118; 28314511
Abstract (Summary)

We propose a methodology to teach an automated agent to learn how to search for multi-hop connections in large corpora by selectively allocating and deploying machine reading resources. The elements of multi-hop connections are often located in different documents that are not know ahead of time. Making it harder for a naive algorithm to exhaustively process a corpus if it is of a reasonable size, e.g. the English Wikipedia or PubMed Central.

We formulate the elements of a novel search framework, focused reading (FR), as a Markov Decision Process, whose state-representation is comprised of domain-agnostic features related to the current state of the search and the dynamics of the search process. We employ reinforcement learning (RL) to find a policy to search for multi-hop connections in the biomedical domain. Our evaluation of the framework finds that the learned policy is more efficient at retrieving multi-hop paths than a strong, deterministic baseline algorithm.

We introduce extensions to the FR framework to evaluate it in an open domain. Besides the domain agnostic state representation features, we introduce a set of features that capture information about the topic distribution of the underlying corpus as well as features that capture the distributional similarity of the entities extracted with machine reading tools. We use RL to find a policy that recovers more multi-hop paths while processing fewer documents than multiple heuristic baselines in an open-domain corpus. We perform an extensive analysis to understand the performance and limits of the method. Semantic drift is found to be a prevalent issue that affects the search outcomes and the coherence of the paths found by FR.

We present first steps towards reducing semantic drift in the biomedical domain by proposing a supervised learning method to assign biological container context to biochemical interactions detected with information extraction. Examples of biological container context include species, organ or tissue type. We propose a set of features based on frequency, syntactic properties and other linguistic properties relative to an expression of a container context and to expressions of biochemical interactions. We experiment with a battery of classification algorithms and compare favorably to a deterministic, location-based baseline. We leave to future the integration of this methodology in a FR implementation.

Indexing (document details)
Advisor: Morrison, Clayton
Commitee: Surdeanu, Mihai, Jansen, Peter A.
School: The University of Arizona
Department: Information Resources & Library Science
School Location: United States -- Arizona
Source: DAI-A 82/7(E), Dissertation Abstracts International
Subjects: Information science, Artificial intelligence, Computer science
Keywords: Information extraction, Information retrieval, Multi-hop relations, Reinforcement learning
Publication Number: 28314511
ISBN: 9798569903023
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy