Dissertation/Thesis Abstract

Unsupervised Learning of Derivational Morphological Constraints
by Sulaiman, Suriani, Ph.D., Indiana University, 2017, 203; 10687281
Abstract (Summary)

Morphological analysis is considered a crucial preliminary step in processing languages with complex morphology. In such languages, morphological analyzers are often built with comprehensive morphotactics, morpho-phonological rules and morpho-syntactic features to predict the properties of words (e.g., part-of-speech) based on affixes. Designing a morphological analyzer that produces a complete analysis however requires extensive human effort, and there is this considerable interest in the unsupervised learning of morphological analysis to reduce the sparse data problem in an under-resourced language such as Malay. The challenge with Malay is that its complex internal morphological structures may lead to an unmanageable large lexicon. We suggest a templatic structure consisting of common affix sequences (i.e., affix-pattern) as an effective solution to save storage spaces through the concept of ‘find and fit’.

This dissertation research investigates the use of morphology-based language modeling and the Expectation Maximization (EM) algorithm for learning the derivational morphology of Malay. We first demonstrate how our model can be utilized to train naïve morphological segmentations. Naïve in this context indicates no knowledge of any constraints, order or co-occurrences of other morphemes. Next, we employ three different EM variants to learn the hidden derivational constraints of Malay. Through improved guesses, our EM algorithms iteratively optimize the maximum likelihood estimates of the partially observed parameters in our dataset in search of the correct segmentation candidates at convergence level. Finally, we evaluate the performances of our EM algorithms against our gold standard and the state-of-the-art segmentation tool, Morfessor 2.0. We find that our EM algorithms perform 10% better than Morfessor 2.0.

The EM algorithm is trained with and without the use of morphology-based language models to observe the effect of language models on the performance of our unsupervised learning algorithm. Our experimental results reveal that the morphology-based language model helps to improve the performance of our EM and suggest that it is feasible to build a lexicon of ‘affix-patterns’ by exploiting naïve morphological segmentations. This body of work contributes to the construction of a lexicon in the form of derivational ‘affix-patterns’ for the derivational morphology of Malay as well as to a better understanding of Malay derivational morphological constraints.

Indexing (document details)
Advisor: Gasser, Michael E.
Commitee: Kübler, Sandra C., Leake, David B., Paolillo, John C.
School: Indiana University
Department: Computer Sciences
School Location: United States -- Indiana
Source: DAI-B 79/04(E), Dissertation Abstracts International
Subjects: Artificial intelligence, Computer science
Keywords: Derivational morphological constraints, Unsupervised learning
Publication Number: 10687281
ISBN: 9780355567588