Humans are able to recognize objects in a scene almost effortlessly. Our visual system can easily handle ambiguous settings, like partial occlusions or large variations in viewpoint. One hypothesis that explains this ability is that we process the scene as a global instance. Using global contextual reasoning (e.g., a car sits on a road, but not on a building facade) can constrain interpretations of objects to plausible, coherent precepts. This type of reasoning has been explored in Computer Vision using weak 2D context, mostly extracted from monocular cues. In this thesis, we explore the benefits of strong 3D context extracted from multiple-view geometry. We demonstrate strong ties between geometric reasoning and object recognition, effectively bridging the gap between them to improve scene understanding.
In the first part of this thesis, we describe the basic principles of structure from motion, which provide strong and reliable geometric models that can be used for contextual scene understanding. We present a novel algorithm for camera localization that leverages search space partitioning to allow a more aggressive filtering of potential correspondences. We exploit image covisibility using a coarse-to-fine, prioritized search approach that can recognize scene landmarks rapidly. This system achieves state of the art results in large-scale camera localization, especially in difficult scenes with frequently repeated structures.
In the second part of this thesis, we study how to exploit these strong geometric models and localized cameras to improve recognition. We introduce an unsupervised training pipeline to generate scene-specific object detectors. These classifiers outperform state of the art and can be used when the rough camera location is known. When precise camera pose is available, we can inject additional geometric cues into novel re-scoring framework to further improve detection. We demonstrate the utility of background scene models for false positive pruning, akin to video-surveillance background subtraction strategies. Finally, we observe that the increasing availability of mapping data stored in Geographic Information Systems (GIS) provides strong geo-semantic information that can be used when cameras are located in world coordinates. We propose a novel contextual reasoning pipeline that uses lifted 2D GIS models to quickly retrieve precise geo-semantic priors. We use these cues to to improve object detection and image semantic segmentation, providing a successful trade-off of false positives that boosts average precision over baseline detection models.
|Advisor:||Fowlkes, Charless C.|
|Commitee:||Markopoulou, Athina, Ramanan, Deva|
|School:||University of California, Irvine|
|Department:||Electrical and Computer Engineering - Ph.D.|
|School Location:||United States -- California|
|Source:||DAI-B 78/04(E), Dissertation Abstracts International|
|Keywords:||Camera localization, Computer vision, Image segmentation, Object recognition, Scene understanding|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be