One way to understand the visual world is by reasoning about the objects present in it: their type, their location, their similarities, their layout etc. Despite several successes, detailed recognition remains a challenging tasks for current computer vision systems. This dissertation focuses on building systems that improve on the state-of-the-art on several fronts. On one hand, we propose better representations of visual categories that enable more accurate reasoning about their properties. To learn such representations, we employ machine learning methods that leverage the power of big-data. On the other hand, we present solutions to make current frameworks more efficient without losing on performance.
The first part of the dissertation focuses on improvements in efficiency. We first introduce a fast automated mechanism for selecting a diverse set of discriminative filters and show that one can efficiently learn a universal model of filter "goodness" based on properties of the filter itself. As an alternative to the expensive evaluation of filters, which is often the bottleneck in many techniques, our method has the potential of dramatically altering the trade-off between the accuracy of a filter based method and the cost of training. Second, we present a method for linear dimensionality reduction which we call composite discriminant factor analysis (CDF). CDF searches for a discriminative but compact feature subspace in which the classifiers can be trained, leading to an order of magnitude saving in detection time.
In the second part, we focus on the problem of person re-identification, an important component of surveillance systems. We present a deep learning architecture that simultaneously learns features and computes their corresponding similarity metric. Given a pair of images as input, our network outputs a similarity value indicating whether the two input images depict the same person. We propose new layers which capture local relationships among mid-level features, produce a high-level summary of these relationships and spatially integrate them to give a holistic representation.
In the final part, we present a semantic object selection framework that uses natural language input to perform image editing. In the general context of interactive object segmentation, many of the methods that utilize user input (such as mouse clicks and mouse strokes) often require significant user intervention. In this work, we present a system with a far simpler input method: the user only needs to give the name of the desired object. For this problem we present a solution which borrows ideas from image retrieval, segmentation propagation, object localization and convolution neural networks.
|Advisor:||Davis, Larry S.|
|Commitee:||Aloimonos, Yiannis, Duraiswami, Ramani, Jacobs, David W., Lin, Jimmy|
|School:||University of Maryland, College Park|
|School Location:||United States -- Maryland|
|Source:||DAI-B 76/11(E), Dissertation Abstracts International|
|Keywords:||Computer vision, Deep learning, Filter selection, Object detection, Object selection, Person re-identification|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be