Dissertation/Thesis Abstract

Robust Model Estimation Methods For Information Retrieval
by Collins-Thompson, Kevyn, Ph.D., Carnegie Mellon University, 2008, 220; 3528730
Abstract (Summary)

Information retrieval algorithms attempt to match a user's description of their information need with relevant information in a collection of documents or other data. Applications include Web search engines, filtering and recommendation systems, computer-assisted language tutors, and many others. A key challenge of retrieval algorithms is to perform effective matching when many factors, such as the user's true information need, may be highly uncertain and can only be partially observed via a small number of keywords. This dissertation develops broadly applicable algorithms for measuring and exploiting such uncertainty in retrieval algorithms to make them more effective and reliable. Our contributions include new theoretical models, statistical methods, evaluation techniques, and retrieval algorithms.

As an application, we focus on a long-studied approach to improving retrieval matching that adds related terms to a query — a process known as query expansion. Query expansion works well on average, but even state-of-the-art methods are still highly unreliable and can greatly hurt results for individual queries. We show how sensitivity information for an expansion algorithm can be obtained and used to improve its reliability without reducing overall effectiveness.

Our approach proceeds in two steps. First, treating the base expansion method as a 'black box', we gather information about how the algorithm's output — a set of expansion terms — changes with perturbations of the initial query and top-ranked documents. This step also results in a set of plausible expansion model candidates. We then introduce a novel risk framework based on convex optimization that prunes and combines these candidates to produce a much more reliable version of the original baseline expansion algorithm. Highlights of our results include: • A new algorithmic framework for estimating more precise query and document models, based on treating queries and document sets as random variables instead of single observations. • The first significant application and analysis of convex optimization methods to query expansion problems in information retrieval. • A new family of statistical similarity measures we call perturbation kernels that are efficient to compute and give context-sensitive word clustering. • The introduction of risk-reward analysis to information retrieval, including tradeoff curves, analysis, and risk measures. • A new general form of query difficulty measure that reflects clustering in the collection as well as the relation between a query and the collection.

Indexing (document details)
Advisor: Callan, Jamie
Commitee:
School: Carnegie Mellon University
School Location: United States -- Pennsylvania
Source: DAI-B 74/01(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Artificial intelligence, Computer science
Keywords: Information retrieval, Query expansion
Publication Number: 3528730
ISBN: 9781267626738
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest