Online repository hosts such as GitHub provide a wealth of information that can be used to inform developers on who and what to follow, indicate project quality and health, and to learn what constitutes good and normative coding practices. This environment allows developers to decide what contributions to make based on social and technical factors. In particular, communication is critical to producing successful software. While this includes direct communication like issue and bug discussions, emails, and chatting, it also includes the implicit signals generated by the software process. In particular, one form of human communication that is overlooked in this process is the source code itself.
Code is a form of human communication. Developers do not write code only for the machine to run; other contributors must also read and understand the code to tackle the difficult and always ongoing task of maintaining it. Thus, when writing code, developers must simultaneously correctly implement the program’s functionality while also making it easy to understand. In this context, treating programming as a form of language (albeit one constrained by algorithmic considerations), we can reevaluate the recent boom in adapting statistical language models from natural language to source code. Though these models typically find code much more predictable than natural language, the reasons behind these differences are not yet well understood, and the task of finding the best way to adapt them to this new context is an ongoing challenge.
This dissertation first looks at the wider context of mining GitHub to extract social and technical information and how this influences developers. Then, it focuses on the question of validating and understanding what statistical language models tell us about code as a form of human communication. Where does the predictability of source code come from—the inherent syntax of code or from how developers choose to write it? Are metrics used to train language models, such as surprisal, which relate to cognitive load in natural language, still valid in a source code context? Answers to these questions reveal that some of code’s predictability comes from human choices, and that surprisal relates to both human preferences and comprehension time in code. By improving the understanding of how language models work in code and how well their assumptions fit human cognitive processes, we are better positioned to create tools that effectively use them. Finally, comparing these models in both natural language and code provides a new window into understanding human communication under varying constraints.
|Commitee:||Filkov, Vladimir, Rubio González, Cindy|
|School:||University of California, Davis|
|School Location:||United States -- California|
|Source:||DAI-A 82/3(E), Dissertation Abstracts International|
|Subjects:||Computer science, Linguistics, Cognitive psychology|
|Keywords:||Mining , Open source software ecosystems, Understanding code, Human communication, Statistical language models|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be