Recent years have seen significant advancements in artificial intelligence (AI). Still, we are far from intelligent agents that can visually perceive their surroundings, reason, and interact with humans in natural language, thereby being an integral part of our lives.
As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress.
In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotated dialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitatively outperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system.
We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.
|Advisor:||Moura, José M. F.|
|Commitee:||Batra, Dhruv, Bigham, Jeffrey P., Parikh, Devi, Sankaranarayanan, Aswin|
|School:||Carnegie Mellon University|
|Department:||Electrical and Computer Engineering|
|School Location:||United States -- Pennsylvania|
|Source:||DAI-B 80/11(E), Dissertation Abstracts International|
|Subjects:||Artificial intelligence, Computer science|
|Keywords:||Artificial intelligence, Computer vision, Image understanding, Natural language dialog, Visual dialog|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be