The ability to quickly and accurately understand pixel-level scene semantics is a key capability required for various robotics applications such as autonomous driving. Until now, the temporal aspect of this problem has been largely overlooked. Therefore, the focus of research in this dissertation is to study the impact of temporal information in perception-related tasks and investigate whether it is useful to be included, more specifically for semantic scene understanding.
In this thesis, we first propose a set of novel techniques for unsupervised spatio-temporal segmentation in video sequences to obtain regions that are coherent in space and time. We then extend our method to exploit other strong cues present in the scene such as the depth signal or object parts to further improve the accuracy. The bottleneck in studying the temporal data is caused by both the limitations in computing resources and/or the lack of existing comprehensive labeled data. We tackle these issues by introducing a simple and efficient unsupervised label propagation algorithm that transfers the pixel-wise semantic labels from a groundtruth frame to its adjacent neighbor frames and produces auxiliary temporal groundtruth. Finally, we take a further step towards the ultimate goal of holistic scene understanding and present a deep, recurrent multi-scale network that is capable of leveraging the temporal information present in the video data. We show that our model can be easily extended to the related problem of prediction to estimate the expected semantics of the scene a small number of frames into the future. We achieve promising state-of-the-art results on various datasets and prove that our temporal approach is superior to the non-temporal baseline.
|Commitee:||Blaschko, Matthew, Correll, Nikolaus, Heckman, Christoffer, Martin, James, Sibley, Gabe|
|School:||University of Colorado at Boulder|
|School Location:||United States -- Colorado|
|Source:||DAI-B 78/10(E), Dissertation Abstracts International|
|Subjects:||Robotics, Artificial intelligence, Computer science|
|Keywords:||3d superpixels, Efficient spectral clustering, Semantic scene understanding, Spatio-temporal segmentation, Temporal object discovery, Video understanding|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be