Dissertation/Thesis Abstract

Full and Partial Diacritic Restoration: Development and Impact on Downstream Applications
by Alqahtani, Sawsan, Ph.D., The George Washington University, 2020, 153; 27668278
Abstract (Summary)

Languages that include diacritics in speech but omit diacritics in writing to a certain degree result in written texts that are even more ambiguous than typically expected. Not including diacritics in written texts increases the number of possible word meanings and pronunciations, which poses a challenge for computational models due to increased ambiguity. The Yoruba word mu when unmarked means “drink,” but when diacritized as or means “sink” or “sharp,” respectively. As an example, in English if we omit the vowels in the word pn, the word can be read as pan, pin, pun, and pen; each has a different meaning and pronunciation.

In this dissertation, we discuss diacritic restoration models as a solution for this problem. This entails a process of automating the restoration of missing diacritics for each character in a written text in order to render the resulting text comparable to that of languages in which words are fully orthographically specified such as English. We first discuss different solutions to fully specify diacritics in written texts.

We investigate different architectures that can provide better alternatives than the current state-of-the-art architectures; we analyze their potential and limitations. We find that sequence-to-sequence classification in the context of diacritic restoration provides a better solution in some cases with the downside of generating sentences that are not of the same length as the input as well as generating words that are not a diacritic variant to the input unit (hallucination). We suggest a more efficient convolutional-based architecture yielding comparable accuracy that outperforms recurrent-based models. With both models, there is a trade-off between efficiency and accuracy.

Having determined that Bidirectional Long Short Term Memory (BiLSTM) is currently the best architecture for diacritic restoration in terms of accuracy, we further investigate how to enhance its accuracy via different methods. We investigate the impact of different input and output representation for diacritic restoration to identify the optimal input unit for the task of diacritic restoration. We find that characters provide the optimal solution. We also propose a joint diacritic restoration model in which diacritics are learned along with the other linguistic features helpful for assigning appropriate diacritics. This provides a better solution for diacritic restoration.

We likewise investigate the impact of fully specifying diacritics in extrinsic evaluation. In theory, full diacritic restoration helps disambiguate homographs but in practice it results in increased sparsity (i.e. insufficient training examples for each word) and out-of-vocabulary words, which degrades the performance of downstream applications. Thus, after shedding light on different techniques that may boost the performance of full diacritic restoration, we attempt to find a sweet spot between zero and full diacritization (i.e. partial diacritization) as a replacement for full diacritization in order to reduce lexical ambiguity without increasing sparsity. Partial diacritic restoration has been theorized but never systematically addressed before. We discuss different automated techniques as a viable solution to identify partial diacritic schemes and examine whether partial diacritic restoration is beneficial on downstream applications. Although our findings are inconclusive, we build a foundation for future research on partial diacritic restoration and discuss current challenges at multiple levels.

Indexing (document details)
Advisor: Diab, Mona T.
Commitee: Youssef, Abdou, Caliskan, Aylin, Habash, Nizar, Pless, Robert
School: The George Washington University
Department: Computer Science
School Location: United States -- District of Columbia
Source: DAI-B 81/7(E), Dissertation Abstracts International
Source Type: DISSERTATION
Subjects: Computer science
Keywords: Diacritic restoration, Downstream applications
Publication Number: 27668278
ISBN: 9781392803592
Copyright © 2021 ProQuest LLC. All rights reserved. Terms and Conditions Privacy Policy Cookie Policy
ProQuest