Languages that include diacritics in speech but omit diacritics in writing to a certain degree result in written texts that are even more ambiguous than typically expected. Not including diacritics in written texts increases the number of possible word meanings and pronunciations, which poses a challenge for computational models due to increased ambiguity. The Yoruba word mu when unmarked means “drink,” but when diacritized as mù or mú means “sink” or “sharp,” respectively. As an example, in English if we omit the vowels in the word pn, the word can be read as pan, pin, pun, and pen; each has a different meaning and pronunciation.
In this dissertation, we discuss diacritic restoration models as a solution for this problem. This entails a process of automating the restoration of missing diacritics for each character in a written text in order to render the resulting text comparable to that of languages in which words are fully orthographically specified such as English. We first discuss different solutions to fully specify diacritics in written texts.
We investigate different architectures that can provide better alternatives than the current state-of-the-art architectures; we analyze their potential and limitations. We find that sequence-to-sequence classification in the context of diacritic restoration provides a better solution in some cases with the downside of generating sentences that are not of the same length as the input as well as generating words that are not a diacritic variant to the input unit (hallucination). We suggest a more efficient convolutional-based architecture yielding comparable accuracy that outperforms recurrent-based models. With both models, there is a trade-off between efficiency and accuracy.
Having determined that Bidirectional Long Short Term Memory (BiLSTM) is currently the best architecture for diacritic restoration in terms of accuracy, we further investigate how to enhance its accuracy via different methods. We investigate the impact of different input and output representation for diacritic restoration to identify the optimal input unit for the task of diacritic restoration. We find that characters provide the optimal solution. We also propose a joint diacritic restoration model in which diacritics are learned along with the other linguistic features helpful for assigning appropriate diacritics. This provides a better solution for diacritic restoration.
We likewise investigate the impact of fully specifying diacritics in extrinsic evaluation. In theory, full diacritic restoration helps disambiguate homographs but in practice it results in increased sparsity (i.e. insufficient training examples for each word) and out-of-vocabulary words, which degrades the performance of downstream applications. Thus, after shedding light on different techniques that may boost the performance of full diacritic restoration, we attempt to find a sweet spot between zero and full diacritization (i.e. partial diacritization) as a replacement for full diacritization in order to reduce lexical ambiguity without increasing sparsity. Partial diacritic restoration has been theorized but never systematically addressed before. We discuss different automated techniques as a viable solution to identify partial diacritic schemes and examine whether partial diacritic restoration is beneficial on downstream applications. Although our findings are inconclusive, we build a foundation for future research on partial diacritic restoration and discuss current challenges at multiple levels.
|Advisor:||Diab, Mona T.|
|Commitee:||Youssef, Abdou, Caliskan, Aylin, Habash, Nizar, Pless, Robert|
|School:||The George Washington University|
|School Location:||United States -- District of Columbia|
|Source:||DAI-B 81/7(E), Dissertation Abstracts International|
|Keywords:||Diacritic restoration, Downstream applications|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be