This work explores the association between human personality and language features consisting of sequences of tokens. My work reveals that there are such features that are predictive of personality over multiple corpora taken from different populations of English speakers. I gathered written text authored by 50 individuals who participated on a bodybuilding web forum (the Forum corpus). Also I administered a personality questionnaire following the protocol provided by the International Personality Item Pool (IPIP). For comparison across other populations I also obtained text corpora from three other research groups, along with the results of personality assessments: the EAR corpora consisting of transcripts of the speech of 96 participants as they go about their daily lives, Essays written by 2,588 undergraduates at the University of Texas and posts by 244 Facebook users. After performing part-of-speech (POS) tagging on the text for all the participants in these corpora, I extracted unigrams, bigrams and trigrams (n-grams) of tokens and their POS tags, and counted every word/tag permutation that appeared.
I considered only features appearing one or more times per 1000 words in the Forum corpus because there was not enough data to consider sparser features. I found 766 such features. From among those features I explored which were relevant across both my Forum corpus and at least one of the borrowed corpora, since those are the most promising, robust features that illustrate the possibility of building models across various corpora using the same language features. 75 of the features were associated with one or more personality dimensions across both the Forum corpus and at least one additional corpus. I devised explanations as to why some of the features are correlated with a given personality dimension. That task establishes that although some of the features may have arisen randomly, one can confidently proceed with the conclusion that English speakers consistently express their personalities through their language usage. In addition, to show that it is possible to use these features for prediction, I generated multiple linear regression models for each corpora-personality dimension combination; in the best case (Openness with the Forum corpus) I obtained R2 of 0.686 and S (standard error of the estimate) of 0.561. My work sets a foundation for more robust, accurate models of personality. I hope that others will find additional principled explanations of why the features I found are associated with personality. In the future I anticipate that suitable language-analytical techniques will deepen insight both in the case of English speakers and speakers of additional world languages.
|Commitee:||Hayashi, Kentaro, Kuh, Anthony, Ogawa, Michael-Brian, O'Grady, William, Robertson, Scott|
|School:||University of Hawai'i at Manoa|
|School Location:||United States -- Hawaii|
|Source:||DAI-A 82/1(E), Dissertation Abstracts International|
|Subjects:||Computer science, Psychology, Linguistics|
|Keywords:||Sequences of tokens, Part-of-speech tagging, Human personality|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be