Dissertation/Thesis Abstract

Mining Open Source Software Ecosystems and Understanding Code as Human Communication through Statistical Language Models
by Casalnuovo, Casey Richard, Ph.D., University of California, Davis, 2020, 252; 27999935
Abstract (Summary)

Online repository hosts such as GitHub provide a wealth of information that can be used to inform developers on who and what to follow, indicate project quality and health, and to learn what constitutes good and normative coding practices. This environment allows developers to decide what contributions to make based on social and technical factors. In particular, communication is critical to producing successful software. While this includes direct communication like issue and bug discussions, emails, and chatting, it also includes the implicit signals generated by the software process. In particular, one form of human communication that is overlooked in this process is the source code itself.

Code is a form of human communication. Developers do not write code only for the machine to run; other contributors must also read and understand the code to tackle the difficult and always ongoing task of maintaining it. Thus, when writing code, developers must simultaneously correctly implement the program’s functionality while also making it easy to understand. In this context, treating programming as a form of language (albeit one constrained by algorithmic considerations), we can reevaluate the recent boom in adapting statistical language models from natural language to source code. Though these models typically find code much more predictable than natural language, the reasons behind these differences are not yet well understood, and the task of finding the best way to adapt them to this new context is an ongoing challenge.

This dissertation first looks at the wider context of mining GitHub to extract social and technical information and how this influences developers. Then, it focuses on the question of validating and understanding what statistical language models tell us about code as a form of human communication. Where does the predictability of source code come from—the inherent syntax of code or from how developers choose to write it? Are metrics used to train language models, such as surprisal, which relate to cognitive load in natural language, still valid in a source code context? Answers to these questions reveal that some of code’s predictability comes from human choices, and that surprisal relates to both human preferences and comprehension time in code. By improving the understanding of how language models work in code and how well their assumptions fit human cognitive processes, we are better positioned to create tools that effectively use them. Finally, comparing these models in both natural language and code provides a new window into understanding human communication under varying constraints.

Indexing (document details)
Advisor: Devanbu, Premkumar
Commitee: Filkov, Vladimir, Rubio González, Cindy
School: University of California, Davis
Department: Computer Science
School Location: United States -- California
Source: DAI-A 82/3(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer science, Linguistics, Cognitive psychology
Keywords: Mining , Open source software ecosystems, Understanding code, Human communication, Statistical language models
Publication Number: 27999935
ISBN: 9798672184180
Copyright © 2020 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest