One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines these attributes into a coherent description. While significant milestones have been achieved in the field of computer vision, majority of the work has been concentrated on supervised visual recognition where complex visual representations are learned and a few discrete categories or labels are assigned to these representations. This implies a closed world where the underlying assumption is that all environments contain the same objects and events, which are in one-to-one correspondence with the ground evidence in the image. Hence, the learned knowledge is limited to the annotated training set. An open world, on the other hand, does not assume the distribution of semantics and requires generalization beyond the training annotations. Increasingly complex models require massive amounts of training data and offer little to no explainability due to the lack of transparency in the decision-making process. The strength of artificial intelligence systems to offer explanations for their decisions is central to building user confidence and structuring smart human-machine interactions.
In this dissertation, we develop an inherently explainable approach for generating rich interpretations of visual scenes. We move towards an open world open-domain visual understanding by decoupling the ideas of recognition and reasoning. We integrate common sense knowledge from large knowledge bases such as ConceptNet and the representation learning capabilities of deep learning approaches in a pattern theory formalism to interpret a complex visual scene. To be specific, we first define and develop the idea of contextualization to model and establish complex semantic relationships among concepts grounded in visual data. The resulting semantic structures, called interpretations allow us to represent the visual scene in an intermediate representation that can then be used as the source of knowledge for various modes of expression such as labels, captions and even question answering. Second, we explore the inherent explainability of such visual interpretations and define key components for extending the notion of explainability to intelligent agents for visual recognition. Finally, we describe a self-supervised model for segmenting untrimmed videos into its constituent events. We show that this approach can segment videos without the need for supervision - neither implicit nor explicit.
Combined, we argue that these approaches offer an elegant path to inherently explainable, open domain visual understanding while negating the need for human supervision in the form of labels and/or captions. We show that the proposed approach can advance the state-of-the-art results in complex benchmarks to handle data imbalance, complex semantics, and complex visual scenes without the need for vast amounts of domain-specific training data. Extensive experiments on several publicly available datasets show the efficacy of the proposed approaches. We show that the proposed approaches outperform weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines. The self-supervised approach for video segmentation complements this top-down inference with efficient bottom-up processing, resulting in an elegant formalism for open-domain visual understanding.
|Advisor:||Sarkar, Sudeep, Srivastava, Anuj|
|Commitee:||Dubey, Rajiv, Sun, Yu, Malmberg, Kenneth, Licato, John|
|School:||University of South Florida|
|Department:||Computer Science and Engineering|
|School Location:||United States -- Florida|
|Source:||DAI-B 81/2(E), Dissertation Abstracts International|
|Keywords:||Activity interpretation, Common sense reasoning, Temporal segmentation, Video understanding|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be