This dissertation applies convolutional networks for learning representations of text, and it consists of several parts. The first part offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks. These results indicate that using low-level inputs – in this case characters – for convolutional networks could be feasible for text representation learning. The second part concerns which text encoding method might work for convolutional networks. We include a comprehensive comparison of different encoding methods for the task of text classification using 14 large-scale datasets in 4 languages including Chinese, English, Japanese and Korean. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. From these 473 models, one of the conclusions is that byte-level one-hot encoding works consistently best for convolutional networks. Based on this, in the third part of the dissertation we develop a convolutional network at the level of bytes for learning representations through the task of auto-encoding. The proposed model is a multi-stage deep convolutional encoder-decoder framework using residual connections, containing up to 160 parameterized layers. Each encoder or decoder contains a shared group of modules that consists of either pooling or up-sampling layers, making the network recursive in terms of abstraction levels in representation. The decoding process is non-sequential. Results for 6 large-scale paragraph datasets are reported, in 3 languages including Arabic, Chinese and English. Analyses are conducted to study several properties of the proposed model. Experiments are presented to verify that the auto-encoder can learn useful representations. In the fourth part of the dissertation, we use the improved design from the previous auto-encoding model to text classification, adding comparisons between residual and dense connections. This further validates the choice of the architecture we made for the auto-encoding model, and the effectiveness of the recursive architecture with residual or dense connections.
|Commitee:||Bowman, Samuel R., Cho, Kyunghyun, Chopra, Sumit, Szlam, Arthur|
|School:||New York University|
|School Location:||United States -- New York|
|Source:||DAI-B 80/07(E), Dissertation Abstracts International|
|Keywords:||Auto-encoding, Convolutional networks, Text classification, Text encoding, Text representation|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be