Information retrieval algorithms attempt to match a user's description of their information need with relevant information in a collection of documents or other data. Applications include Web search engines, filtering and recommendation systems, computer-assisted language tutors, and many others. A key challenge of retrieval algorithms is to perform effective matching when many factors, such as the user's true information need, may be highly uncertain and can only be partially observed via a small number of keywords. This dissertation develops broadly applicable algorithms for measuring and exploiting such uncertainty in retrieval algorithms to make them more effective and reliable. Our contributions include new theoretical models, statistical methods, evaluation techniques, and retrieval algorithms.
As an application, we focus on a long-studied approach to improving retrieval matching that adds related terms to a query — a process known as query expansion. Query expansion works well on average, but even state-of-the-art methods are still highly unreliable and can greatly hurt results for individual queries. We show how sensitivity information for an expansion algorithm can be obtained and used to improve its reliability without reducing overall effectiveness.
Our approach proceeds in two steps. First, treating the base expansion method as a 'black box', we gather information about how the algorithm's output — a set of expansion terms — changes with perturbations of the initial query and top-ranked documents. This step also results in a set of plausible expansion model candidates. We then introduce a novel risk framework based on convex optimization that prunes and combines these candidates to produce a much more reliable version of the original baseline expansion algorithm. Highlights of our results include: • A new algorithmic framework for estimating more precise query and document models, based on treating queries and document sets as random variables instead of single observations. • The first significant application and analysis of convex optimization methods to query expansion problems in information retrieval. • A new family of statistical similarity measures we call perturbation kernels that are efficient to compute and give context-sensitive word clustering. • The introduction of risk-reward analysis to information retrieval, including tradeoff curves, analysis, and risk measures. • A new general form of query difficulty measure that reflects clustering in the collection as well as the relation between a query and the collection.
|School:||Carnegie Mellon University|
|School Location:||United States -- Pennsylvania|
|Source:||DAI-B 74/01(E), Dissertation Abstracts International|
|Subjects:||Artificial intelligence, Computer science|
|Keywords:||Information retrieval, Query expansion|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be