The field of Natural Language Processing (NLP) has progressed rapidly in recent years due to the evolution of deep neural models and their constituent, word embeddings. While word embeddings are often used as a primary component, the ultimate goal of any NLP system is to capture the underlying linguistic characteristics of word sequences, including phrases, sentences, or paragraphs, that are ideal for an end task. To generate these kinds of embeddings, most NLP models rely on a mathematical operation, such as averaging or other pooling mechanisms, over smaller units such as words, morphemes, or even characters. However, these representations tend to be task-specific and typically perform poorly when transferred to other tasks. Consequently, different models have been proposed to generate general-purpose sentence embeddings to be used in a pretraining kind of protocol. The most notable differences between these models are the efficiency and performance trade-offs. However, to date, most proposed embedding models suffer from indifference toward underlying syntactic and semantic characteristics of the text.
To this end, in this dissertation, we develop an efficient sentence embedding model that is capable of capturing both syntactic and semantic properties. First, we use Discrete Cosine Transform (DCT) to compress word sequences in an order-preserving manner. The lower-order DCT coefficients represent the overall feature patterns in sentences, which result in suitable embeddings for tasks that could benefit from syntactic features. Our results in semantic probing tasks demonstrate that DCT embeddings indeed preserve more syntactic information compared with most commonly used approach, vector averaging. With practically equivalent complexity, the DCT model yields better overall performance in downstream classification tasks that correlate with syntactic features. This illustrates the capacity of DCT to preserve word-order information.
We further validate the efficiency of our DCT embedding in multi- and cross-lingual settings. Specifically, we investigate the generality of the representations across different languages that exhibit different linguistic properties as a language-independent model and a cross-lingual model. We empirically show that the performance of the DCT embeddings is comparable across different languages for all examined tasks. Moreover, in the cross-lingual setting, DCT embeddings resulted in superior performance in sentence translation retrieval compared to the other state-of-the-art models across all language pairs. These results reaffirmed the power of the structural properties encoded on the lower-order DCT coefficients which are used to generate the final fixed-length sentence
A major weakness of DCT, however, is loss of linguistic information, e.g. "man bitten by dog" embeddings rendered from the lower-order DCT coefficients are more similar to "man bites dog" embeddings than the semantically similar example "dog bites man". To address this deficiency, we propose to explicitly model linguistic information in the DCT framework using a block based representation protocol. The blocks reflect various levels of linguistic representation such as ngram chunk, syntactic dependencies and shallow semantic representations. Overall, our results show that augmenting the DCT encoding with Block-based representations improves performance relative to the vanilla baseline (sentence only encoding) for both probing and downstream classification tasks.
|Advisor:||Diab, Mona T.|
|Commitee:||Youssef, Abdou, Pless, Robert, Caliskan, Aylin, Stoyanov, Veselin|
|Keywords:||Discrete cosine transform, Natural language processing, Sentence embeddings|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be