We propose a methodology to teach an automated agent to learn how to search for multi-hop connections in large corpora by selectively allocating and deploying machine reading resources. The elements of multi-hop connections are often located in different documents that are not know ahead of time. Making it harder for a naive algorithm to exhaustively process a corpus if it is of a reasonable size, e.g. the English Wikipedia or PubMed Central.
We formulate the elements of a novel search framework, focused reading (FR), as a Markov Decision Process, whose state-representation is comprised of domain-agnostic features related to the current state of the search and the dynamics of the search process. We employ reinforcement learning (RL) to find a policy to search for multi-hop connections in the biomedical domain. Our evaluation of the framework finds that the learned policy is more efficient at retrieving multi-hop paths than a strong, deterministic baseline algorithm.
We introduce extensions to the FR framework to evaluate it in an open domain. Besides the domain agnostic state representation features, we introduce a set of features that capture information about the topic distribution of the underlying corpus as well as features that capture the distributional similarity of the entities extracted with machine reading tools. We use RL to find a policy that recovers more multi-hop paths while processing fewer documents than multiple heuristic baselines in an open-domain corpus. We perform an extensive analysis to understand the performance and limits of the method. Semantic drift is found to be a prevalent issue that affects the search outcomes and the coherence of the paths found by FR.
We present first steps towards reducing semantic drift in the biomedical domain by proposing a supervised learning method to assign biological container context to biochemical interactions detected with information extraction. Examples of biological container context include species, organ or tissue type. We propose a set of features based on frequency, syntactic properties and other linguistic properties relative to an expression of a container context and to expressions of biochemical interactions. We experiment with a battery of classification algorithms and compare favorably to a deterministic, location-based baseline. We leave to future the integration of this methodology in a FR implementation.
|Commitee:||Surdeanu, Mihai, Jansen, Peter A.|
|School:||The University of Arizona|
|Department:||Information Resources & Library Science|
|School Location:||United States -- Arizona|
|Source:||DAI-A 82/7(E), Dissertation Abstracts International|
|Subjects:||Information science, Artificial intelligence, Computer science|
|Keywords:||Information extraction, Information retrieval, Multi-hop relations, Reinforcement learning|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be