Dissertation/Thesis Abstract

Topic Modeling the Reading and Writing Behavior of Information Foragers
by Murdock, Jaimie, Ph.D., Indiana University, 2019, 143; 13900717
Abstract (Summary)

How do individuals create a knowledge base over a lifetime? Charles Darwin left detailed records of every book he read from The Voyage of the Beagle to just after publication of The Origin of Species. Additionally, he left copies of his drafts before publication. I use these records to build a case study of how reading and writing interact to create conceptual novelties, such as the theory of natural selection and modification by descent. The model is extended to cover entire disciplines by bootstrapping reading and writing histories from bibliographies in scientific publications, scaling the model to address the question of how we move from an individual psychology to society?

There are two central components from cognitive science that impact the proposed models. The first is bounded cognition. People have limited attention, and that attention is further limited by an individual’s information processing ability. Information foraging is a framework for managing the trade-off between exploration of new information and exploitation of existing knowledge when searching for information. Most existing work on information foraging and bounded cognition examine short-term information foraging problems, such as formulating web search queries in a laboratory setting with a known information goal. Through the case study of Charles Darwin, we use real-world datasets to explore this problem at a timescale of decades with unknown information goals.

The base of the reading model is topic modeling with Latent Dirichlet Allocation (LDA). This method reduces the dimensionality of text by reducing each document to a topic distribution, where each topic is defined as a probability distribution over the words in the collection. With these probability distributions, we are able to apply information theoretic measures to calculate the divergence between texts. These divergences characterize a particular reading decision as exploiting the topics exposed by previously read texts or exploring new topics. I train these topic models not on the records, but identify each volume in the Hathi Trust Digital Library and train the topic model on the full text of the books.

While Darwin’s reading notebooks and manuscript drafts provide relatively precise information on reading and writing behaviors at a day-level granularity, that type of data is rare. I explore three extensions of the models, dealing with progressively more “fuzzy” data. First, I look at the contents of Darwin’s Library at the time of his death to infer readings 1860-1882. These readings are used to provide a preliminary analysis of his work on The Descent of Man and the latter editions of the Origin of Species. Then, I look at another historical figure: Thomas Jefferson, whose working library formed the basis of the Library of Congress. We examine the bibliography of his retirement library and tie it into his correspondence to find possible evidence for when certain volumes were read. Finally, I scale the model up to the discipline of neuroscience. I extract citation graphs from the Web of Science to infer reading histories for neuroscientists based on the articles they cited. I use the text of the abstracts of these articles to perform a similar analysis to the Darwin case study on readings and writings. These extensions of the model highlight the potential to work with less precise data and illuminate future problems.

Throughout the work, I emphasize the notion of multiple realizability and interpretive pluralism. Each model is itself a population of models, and while simpler term-frequency-based models may show many of the same effects as the topic models, an argument is made for the explanatory power of the topic model with respect to causality.

Indexing (document details)
Advisor: Allen, Colin, Milojevic, Stasa
Commitee: Jones, Michael, Todd, Peter
School: Indiana University
Department: Informatics
School Location: United States -- Indiana
Source: DAI-A 81/2(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Information science, Cognitive psychology, Science history
Keywords: Digital humanities, Topic modeling, Latent Dirichlet Allocation
Publication Number: 13900717
ISBN: 9781085617130
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest