Dissertation/Thesis Abstract

Visual Dialog: Towards Communicative Visual Agents
by Kottur, Satwik, Ph.D., Carnegie Mellon University, 2019, 161; 13882985
Abstract (Summary)

Recent years have seen significant advancements in artificial intelligence (AI). Still, we are far from intelligent agents that can visually perceive their surroundings, reason, and interact with humans in natural language, thereby being an integral part of our lives.

As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress.

In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotated dialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitatively outperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system.

We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.

Indexing (document details)
Advisor: Moura, José M. F.
Commitee: Batra, Dhruv, Bigham, Jeffrey P., Parikh, Devi, Sankaranarayanan, Aswin
School: Carnegie Mellon University
Department: Electrical and Computer Engineering
School Location: United States -- Pennsylvania
Source: DAI-B 80/11(E), Dissertation Abstracts International
Subjects: Artificial intelligence, Computer science
Keywords: Artificial intelligence, Computer vision, Image understanding, Natural language dialog, Visual dialog
Publication Number: 13882985
ISBN: 978-1-392-18681-7
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy