How do individuals create a knowledge base over a lifetime? Charles Darwin left detailed records of every book he read from The Voyage of the Beagle to just after publication of The Origin of Species. Additionally, he left copies of his drafts before publication. I use these records to build a case study of how reading and writing interact to create conceptual novelties, such as the theory of natural selection and modification by descent. The model is extended to cover entire disciplines by bootstrapping reading and writing histories from bibliographies in scientific publications, scaling the model to address the question of how we move from an individual psychology to society?
There are two central components from cognitive science that impact the proposed models. The first is bounded cognition. People have limited attention, and that attention is further limited by an individual’s information processing ability. Information foraging is a framework for managing the trade-off between exploration of new information and exploitation of existing knowledge when searching for information. Most existing work on information foraging and bounded cognition examine short-term information foraging problems, such as formulating web search queries in a laboratory setting with a known information goal. Through the case study of Charles Darwin, we use real-world datasets to explore this problem at a timescale of decades with unknown information goals.
The base of the reading model is topic modeling with Latent Dirichlet Allocation (LDA). This method reduces the dimensionality of text by reducing each document to a topic distribution, where each topic is defined as a probability distribution over the words in the collection. With these probability distributions, we are able to apply information theoretic measures to calculate the divergence between texts. These divergences characterize a particular reading decision as exploiting the topics exposed by previously read texts or exploring new topics. I train these topic models not on the records, but identify each volume in the Hathi Trust Digital Library and train the topic model on the full text of the books.
While Darwin’s reading notebooks and manuscript drafts provide relatively precise information on reading and writing behaviors at a day-level granularity, that type of data is rare. I explore three extensions of the models, dealing with progressively more “fuzzy” data. First, I look at the contents of Darwin’s Library at the time of his death to infer readings 1860-1882. These readings are used to provide a preliminary analysis of his work on The Descent of Man and the latter editions of the Origin of Species. Then, I look at another historical figure: Thomas Jefferson, whose working library formed the basis of the Library of Congress. We examine the bibliography of his retirement library and tie it into his correspondence to find possible evidence for when certain volumes were read. Finally, I scale the model up to the discipline of neuroscience. I extract citation graphs from the Web of Science to infer reading histories for neuroscientists based on the articles they cited. I use the text of the abstracts of these articles to perform a similar analysis to the Darwin case study on readings and writings. These extensions of the model highlight the potential to work with less precise data and illuminate future problems.
Throughout the work, I emphasize the notion of multiple realizability and interpretive pluralism. Each model is itself a population of models, and while simpler term-frequency-based models may show many of the same effects as the topic models, an argument is made for the explanatory power of the topic model with respect to causality.
|Advisor:||Allen, Colin, Milojevic, Stasa|
|Commitee:||Jones, Michael, Todd, Peter|
|School Location:||United States -- Indiana|
|Source:||DAI-A 81/2(E), Dissertation Abstracts International|
|Subjects:||Information science, Cognitive psychology, Science history|
|Keywords:||Digital humanities, Topic modeling, Latent Dirichlet Allocation|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be