Dissertation/Thesis Abstract

Comparison and Application of Probabilistic Clustering Methods for System Improvement Prioritization
by Lee, Soo Ho, Ph.D., The Ohio State University, 2012, 76; 10631201
Abstract (Summary)

We compare probabilistic clustering methods for analyzing unstructured text or images relevant to prioritizing system improvement actions. Such system improvement activities require an awareness of the entire corpus or set of documents such as transcripts of phone conversations or images. For example, a manager trying to improve the performance of a call center might want to quantitatively understand what the fractions of calls are of a set of types (cluster or topic proportions) and what those types are including the phrases associated phrases (cluster or topic definitions). If a sizable fraction of conversations, e.g., 15%, were using unapproved language, there could be a high priority on implementing standardization or training to reduce cost and improve customer satisfaction related to the identified cluster or topic. We argue that such prioritization could be best understood only if proportions and definitions of all of the clusters or topics can be accounted for accurately.

The goal of accurate accounting for the entire corpus is different from information retrieval goals. Information retrieval relates to identifying specific documents of interest in specific queries. As a result, our comparison is based on “ground truth” models of four entire corpora and four measures of distribution fitting accuracy. Yet, the literature on numerical and case study comparisons of probabilistic clustering methods for cases with ground truth standards is lacking.

Benefits of comparisons based on ground truth models and given corpora also include the provision of complete examples so that readers can see clearly how different approaches can be applied. Further, using the accuracy of cluster identification permits the comparison of popular methods such as fuzzy clustering together with generative methods such as Bayesian mixture models. This is true as long as we interpret the fuzzy clustering model as a topic model which we do. The resulting “fuzzy topic models” offer demonstrated advantages over latent Dirichlet allocation in repeatability and computational efficiency.

These include so-called “topic” models and are generative because they provide a distribution from which entire corpora could be sampled. We provide a numerical study which clarifies the relative accuracy of the probabilistic clustering methods including fuzzy clustering, Principle Component Analysis (PCA) followed by fuzzy clustering, latent Dirichlet allocation (LDA), and the recently proposed Subject Matter Expert Refined Topic (SMERT) Models.

We illustrate the application of the methods to the analysis of a call center in the insurance industry. We also illustrate how prioritization-related information can be derived from the corpus with documents. We also provide documentation of how relevant probabilistic clustering methods can be applied.

Indexing (document details)
Advisor: Allen, Theodore
Commitee: Mount-Campbell, Clark, Patton, Bruce, Xia, Cathy
School: The Ohio State University
Department: Industrial and Systems Engineering
School Location: United States -- Ohio
Source: DAI-B 78/11(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Industrial engineering
Keywords: Comparison, Probabilistic clustering, Smert, Subject matter expert refined topic
Publication Number: 10631201
ISBN: 978-0-355-01481-5
Copyright © 2019 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest