Life science researchers need to find descriptions of genes quickly, in order to understand and interpret the results of their experiments. For this reason, life scientists refer constantly to the biomedical literature to search for articles describing genes they might not be familiar with. Learning facts about genes by reading these documents can be an arduous and time consuming task. Also, searching in millions of documents can return many irrelevant results, as gene names can be highly ambiguous.
In this dissertation, we seek to help biologists quickly find information about genes. We start by finding article abstracts that mention a genes names and synonyms, and automatically filtering out irrelevant abstracts that are introduced due to gene name ambiguities or that only mention the gene in passing. We then mine informative terms about the gene, by identifying terms that have a disproportionately higher frequency when mentioned with the gene than alone. Since some of these terms are meaningful only in context, we automatically identify sentences that succinctly and clearly describe their relations to the gene. Put together, a genes abstracts, informative terms, and descriptive sentences could provide as an overview of the gene, as well as a gateway to the literature for further exploration.
Our evaluations show that the retrieval of gene-centric abstracts is accurate and has high recall, that the terms mined from these documents are relevant to their corresponding genes, and that the sentences describing the relations between genes and their informative terms are rated high by biologists. The system presented in this dissertation is available online and has been already integrated in a gene annotation pipeline.
|Advisor:||Shanker, Vijay K.|
|Commitee:||Carterette, Benjamin A., McCoy, Kathleen F., Schmidt, Carl J., Wu, Cathy H.|
|School:||University of Delaware|
|Department:||Department of Computer and Information Sciences|
|School Location:||United States -- Delaware|
|Source:||DAI-B 72/12, Dissertation Abstracts International|
|Subjects:||Bioinformatics, Computer science|
|Keywords:||Biomedical text mining, Natural language processing, Text mining|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be