Automatic Speech Recognition (ASR) has advanced to the point where state of the art speech recognition algorithms perform reasonably well even for large vocabulary continuous speech recognition in practical environments. Among speech recognition problems, feature extraction, which compresses a speech signal into streams of acoustical feature vectors, has become even more important for ASR since acoustical modeling methods have been well established and language modeling largely depends on the nature of the targeted language. The focus of this dissertation is the determination of effective speech features, where both spectral and temporal variations in speech are captured in a low dimensional representation, for speech recognition tasks.
In this dissertation, a set of spectral-temporal features, namely Discrete Cosine Transform Coefficients (DCTCs) and Discrete Cosine Series Coefficients (DCSCs), is examined for the purpose of capturing both the spectral and temporal variations in speech. Experimental evaluations showed that temporal variations are also of great importance for speech recognition, especially using a long time context.
Additionally, in order to reduce the limitations of the acoustical modeling based on Hidden Markov Models (HMMs), a neural network is utilized as a feature transformer to maximize the discrimination and lessen the correlation of the DCTC/DCSC features. The transformed features lead to a large improvement in the phoneme speech recognition based on the TIMIT database, especially when a small number of states and Gaussian mixtures are used for HMMs.
The neural network feature transforms are viewed as two types of Nonlinear Discriminant Analysis (NLDA) methods for nonlinear dimensionality reduction of speech features since high dimensional features considerably increase computation costs and greatly restrict performance improvement. The first method (NLDA1) uses the final outputs of the network to obtain dimensionality reduced features with the incorporation of the Principal Component Analysis (PCA) processing, while the second one (NLDA2) focuses on the middle layer outputs. The very high phone accuracy obtained with NLDA2 based on TIMIT database was 75.0% using a large number of network training iterations based on state-specific targets.
|Advisor:||Zahorian, Stephen A.|
|Commitee:||Fowler, Mark, Li, Xiaohua, Yin, Lijun|
|School:||State University of New York at Binghamton|
|School Location:||United States -- New York|
|Source:||DAI-B 71/11, Dissertation Abstracts International|
|Keywords:||Automatic speech recognition, Dimensionality reduction, Neural networks, Nonlinear discriminant analysis, Spectral-temporal features|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
supplemental files is subject to the ProQuest Terms and Conditions of use.