Morphological analysis is considered a crucial preliminary step in processing languages with complex morphology. In such languages, morphological analyzers are often built with comprehensive morphotactics, morpho-phonological rules and morpho-syntactic features to predict the properties of words (e.g., part-of-speech) based on affixes. Designing a morphological analyzer that produces a complete analysis however requires extensive human effort, and there is this considerable interest in the unsupervised learning of morphological analysis to reduce the sparse data problem in an under-resourced language such as Malay. The challenge with Malay is that its complex internal morphological structures may lead to an unmanageable large lexicon. We suggest a templatic structure consisting of common affix sequences (i.e., affix-pattern) as an effective solution to save storage spaces through the concept of ‘find and fit’.
This dissertation research investigates the use of morphology-based language modeling and the Expectation Maximization (EM) algorithm for learning the derivational morphology of Malay. We first demonstrate how our model can be utilized to train naïve morphological segmentations. Naïve in this context indicates no knowledge of any constraints, order or co-occurrences of other morphemes. Next, we employ three different EM variants to learn the hidden derivational constraints of Malay. Through improved guesses, our EM algorithms iteratively optimize the maximum likelihood estimates of the partially observed parameters in our dataset in search of the correct segmentation candidates at convergence level. Finally, we evaluate the performances of our EM algorithms against our gold standard and the state-of-the-art segmentation tool, Morfessor 2.0. We find that our EM algorithms perform 10% better than Morfessor 2.0.
The EM algorithm is trained with and without the use of morphology-based language models to observe the effect of language models on the performance of our unsupervised learning algorithm. Our experimental results reveal that the morphology-based language model helps to improve the performance of our EM and suggest that it is feasible to build a lexicon of ‘affix-patterns’ by exploiting naïve morphological segmentations. This body of work contributes to the construction of a lexicon in the form of derivational ‘affix-patterns’ for the derivational morphology of Malay as well as to a better understanding of Malay derivational morphological constraints.
|Advisor:||Gasser, Michael E.|
|Commitee:||Kübler, Sandra C., Leake, David B., Paolillo, John C.|
|School Location:||United States -- Indiana|
|Source:||DAI-B 79/04(E), Dissertation Abstracts International|
|Subjects:||Artificial intelligence, Computer science|
|Keywords:||Derivational morphological constraints, Unsupervised learning|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be