We compare probabilistic clustering methods for analyzing unstructured text or images relevant to prioritizing system improvement actions. Such system improvement activities require an awareness of the entire corpus or set of documents such as transcripts of phone conversations or images. For example, a manager trying to improve the performance of a call center might want to quantitatively understand what the fractions of calls are of a set of types (cluster or topic proportions) and what those types are including the phrases associated phrases (cluster or topic definitions). If a sizable fraction of conversations, e.g., 15%, were using unapproved language, there could be a high priority on implementing standardization or training to reduce cost and improve customer satisfaction related to the identified cluster or topic. We argue that such prioritization could be best understood only if proportions and definitions of all of the clusters or topics can be accounted for accurately.
The goal of accurate accounting for the entire corpus is different from information retrieval goals. Information retrieval relates to identifying specific documents of interest in specific queries. As a result, our comparison is based on “ground truth” models of four entire corpora and four measures of distribution fitting accuracy. Yet, the literature on numerical and case study comparisons of probabilistic clustering methods for cases with ground truth standards is lacking.
Benefits of comparisons based on ground truth models and given corpora also include the provision of complete examples so that readers can see clearly how different approaches can be applied. Further, using the accuracy of cluster identification permits the comparison of popular methods such as fuzzy clustering together with generative methods such as Bayesian mixture models. This is true as long as we interpret the fuzzy clustering model as a topic model which we do. The resulting “fuzzy topic models” offer demonstrated advantages over latent Dirichlet allocation in repeatability and computational efficiency.
These include so-called “topic” models and are generative because they provide a distribution from which entire corpora could be sampled. We provide a numerical study which clarifies the relative accuracy of the probabilistic clustering methods including fuzzy clustering, Principle Component Analysis (PCA) followed by fuzzy clustering, latent Dirichlet allocation (LDA), and the recently proposed Subject Matter Expert Refined Topic (SMERT) Models.
We illustrate the application of the methods to the analysis of a call center in the insurance industry. We also illustrate how prioritization-related information can be derived from the corpus with documents. We also provide documentation of how relevant probabilistic clustering methods can be applied.
|Commitee:||Mount-Campbell, Clark, Patton, Bruce, Xia, Cathy|
|School:||The Ohio State University|
|Department:||Industrial and Systems Engineering|
|School Location:||United States -- Ohio|
|Source:||DAI-B 78/11(E), Dissertation Abstracts International|
|Keywords:||Comparison, Probabilistic clustering, Smert, Subject matter expert refined topic|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be